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Foreign  language  teachers  traditionally  rely  on  their  audjtory-perceptual 
judgments  to  identify  pronunciation  problems,  to  decide  on  learning  strategies, 
and  to  evaluate  progress  in  pronunciation  over  time  and  training.  They  typically 
compare  the  perceived  stimulus  to  their  own  internal  standards  or  scales. 
However,  their  perceptual  ratings  are  contingent  upon  factors  such  as  the  tasks, 
the  influence  of  grammatical  or  lexical  errors,  the  specificity  of  the  rating  scales, 
the  teacher/learner  length  of  acquaintance,  and  the  teachers'  levels  of  fluency 
and  training.  All  these  factors  raise  the  question  of  judgment  reliability. 

The  waveform  of  utterances  can  be  measured.  Waveform  components 
comply  with  the  laws  of  physics  and  can,  therefore,  be  quantified.  In  acoustic 
phonetics,  the  analyst  isolates  specific  cues  to  perform  acoustic  analyses  on 
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selected  features  of  speech  samples,  e.g.,  vowels,  consonants,  and  stress 
patterns.  Such  analyses  include  measures  of  amplitude  (dB),  fundamental 
frequency  (F0)  or  formant  frequency  variations  measured  in  Hertz  (Hz),  and 
duration.  However,  they  require  that  the  measurements  and  the  protocol  be 
strictly  controlled.  In  addition,  the  selected  cues  may  not  necessarily  be  salient 
or  significant  correlates  of  an  'accent,'  thus  raising  the  question  of  validity. 

This  research  measures  the  outcome  of  formal  pronunciation  training  by 
comparing  acoustic  measurements  to  auditory-perceptual  ratings.  The  speech  samples 
consist  of  segmental  and  suprasegmental  audio-recordings  produced  by  forty-fiye 
female  subjects,  aged  18-45,  at  three  dates  approximately  three  months  apart.  There 
were  fifteen  native  speakers  of  French  to  provide  the  normative  baseline,  twenty  adult 
American  students  learning  French,  and  ten  experienced  non-native  speakers  of 
French. 

In  the  acoustic  analyses,  fifteen  dependent  variables  are  analyzed  as  potential 
carriers  of  foreign-accented  pronunciation,  using  the  Kay  Elemetrics  Computerized 
Speech  Lab  (CSL).  The  perceptual  study  involves  six  French  native  speakers  who 
teach  French  but  are  not  trained  in  teaching  pronunciation.  In  each  study,  the  speech 
stimuli  are  evaluated  along  three  temporal  and  pedagogical  parameters:  (1)  pre- 
posttest, (2)  longitudinal  study,  and  (3)  computerized  audio-visual  training. 

The  results  show  that  the  difference  between  Formant  2 and  Formant  1 
frequency  of  [y]  and  [u],  total  duration  of  utterance,  voice  onset  time  (VOT)  of 
syllable  initial  [p],  and  syllable  duration  are  carriers  of  an  English  accent  in 
French  and  correlate  with  perceived  levels  of  foreign  accentedness. 
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CHAPTER  1 
INTRODUCTION 

Statement  of  Purpose 

The  objective  of  the  present  study  is  to  quantify  the  outcome  of  formal 
instruction  in  French  pronunciation  (i.e. , progress  overtime  and  training)  on 
the  speech  of  adult  women  who  are  speakers  of  American-English  learning 
French.  Foreign  accented  pronunciation  may  be  broadly  defined  as  the  faulty 
production  of  a target  language  (L2),  due  to  faulty  perception,  faulty 
articulation,  or  a combination  of  both.  The  nature  of  the  infelicitous  production 
of  L2  sounds  is  grounded  in  the  degree  of  incongruity  between  the  two 
phonologies  in  presence:  the  native  language  (LI)  and  L2.  Foreign  accents 
may  occur  at  both  the  segmental  and  suprasegmental  levels  of  speech.  They 
are  predicated  on  a speaker's  superimposing  the  morpho-phonological  rules 
usually  of  his/her  native  language  (LI ) to  the  sound  system  of  the  target 
language  (L2)  altering,  therefore,  the  expected  pronunciation  patterns  of  this 
language.  The  acquisition  and  monitoring  of  the  L2  sound  system  may  be 
influenced  by  a wide  variety  of  factors  including  both  internal  and  societal 
constraints,  the  study  of  which  has  spawned  dozens  of  theories  and  models. 
The  multiplicity  of  variables  of  interest  reflects  on  the  philosophy  and 
methodology  used  to  assess  progress  in  L2  pronunciation. 
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The  assessment  of  speech  lies  at  the  intersection  between  the  speaker's 
encoded  production  and  the  listener's  perceptual  decoding  process.  The 
robustness  of  speech  sounds  is  predicated  on  articulatory  behavior,  acoustic 
signal  production,  and  linguistic  knowledge.  When  listeners  perceive  a foreign 
accented  pronunciation,  they  perceive  deviations  from  and/or  distortions  of  their 
individual,  subjective  and  internalized  standards.  Are  these  individual  and 
subjective  standards  measurable?  Can  they  be  converted  into  objective  and 
reliable  measures  of  foreign  accent?  Put  another  way,  can  improvement  in 
foreign  accented  pronunciation  be  quantified?  Two  methods  of  evaluation  will 
therefore  be  examined:  the  traditional  auditory  judgment  and  an  acoustically 
determined  assessment  of  progress  in  pronunciation  in  the  formal  setting. 

Problems  and  Questions 


Despite  the  amount  of  qualitative  and  quantitative  studies  performed  on 
the  acquisition  of  an  L2  sound  system,  the  question  of  objective  assessment  of 
progress  and  changes  over  time  in  accented  pronunciation  in  the  formal 
classroom  remains  unanswered.  Using  both  segmental  and  suprasegmental 
features,  the  present  study  aims  at  answering  the  following  questions: 

1 . Can  improvement  in  pronunciation  with  formal  training  be  acoustically 
quantified  relative  to  the  native  (NS)  target  baseline? 

2.  Can  the  effect  of  technology  (Visipitch  II)  used  in  formal  pronunciation 
training  be  acoustically  quantified? 

3.  Can  long-term  retention  in  foreign  pronunciation  be  quantified? 

4.  Do  the  selected  acoustic  cues  -formant  1 and  formant  2 frequency  of  [y]/[u], 
formant  2 frequency  transition  of  [Mi]  and  [wi],  total  duration  of  utterance, 
Voice  Onset  Time  of  syllable-initial  [p]  and  [t],  stressed  vs.  unstressed 
syllable  duration,  fundamental  frequency  variation,  and  formant  1 of 
unstressed  [a]-  correlate  with  auditory-perceptual  ratings? 
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Organization  of  the  Study 

The  present  study  is  organized  in  five  chapters.  The  first  defines  the 
objective  of  the  research  and  stipulates  the  research  questions  to  be  answered. 
Chapter  2 provides  a review  of  the  literature  pertinent  to  the  study.  It  is  divided 
into  three  sections.  The  first  section  describes  the  articulatory  characteristics  of 
English  and  French  sound  systems,  with  a focus  on  four  segmental  features  tu 
[ty]  'you'  vs.  tout  [tu]  'everything',  suite  [suit]  vs.  oui  [wi]  'yes',  and  on  the  phrase 
fais  attention  de  ne  pas  glisser  [fe(z)atasjodcenoepaglise]  'be  careful  not  to  slip', 
selected  from  a contextualized  dialogue  in  French.  The  second  section 
delineates  some  components  of  transferability  on  the  acquisition  of  an  L2  sound 
system;  only  those  pertinent  to  the  present  research  are  examined.  In  the  third, 
the  advantages  and  disadvantages  of  auditory-perceptual  vs.  acoustic 
assessment  in  foreign  accented  speech  are  underscored. 

Chapter  3 describes  the  methodology  used  and  the  design  of  the  study. 
The  research  is  grounded  in  two  interrelated  studies.  For  both,  the  speech 
stimuli  consist  of  audio-recordings,  performed  in  optimum  conditions,  and 
produced  by  forty-five  female  subjects,  aged  18-45,  at  three  dates  approximately 
three  months  apart.  In  the  acoustic  study,  the  accent  of  thirty  non-native 
speakers  of  French  is  assessed  against  fifteen  native  speakers  of  French.  The 
perceptual  study  involves  six  French  native  speakers  who  are  asked  to  listen  to 
the  randomized  stimuli  and  rank  them  per  categories  of  accentedness.  In  each 
experiment,  performance  and  progress  are  evaluated  along  three  temporal  and 
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pedagogical  parameters:  (1)  pre-,  posttest,  (2)  longitudinal  test,  and  (3)  effect  of 
audio-visual  training. 

Chapter  4 presents  the  results  obtained  from  the  descriptive  and  the 
inferential  statistic  analyses  run  on  the  acoustic  and  auditory-perceptual  ratings 
assigned  to  both  the  segmental  and  the  suprasegmental  features.  The  means 
and  mean  differences  between  the  experimental  groups  indicate  that  progress 
over  training  in  pronunciation,  training  on  the  Visipitch,  and  long-term  retention 
can  be  measured.  Examination  of  the  data  shows  consistent  patterns  of 
improvement  in  pronunciation  along  the  temporal  cues;  training  had  an  effect  on 
the  L2  learners’  output,  effect  that  was  quantified  by  numerical  changes  on  the 
acoustic  signals.  Conversely,  training  had  no  effect  on  the  production  of  [u]  by 
NNS  learners.  The  results  of  the  study  further  suggest  that  ten  of  the  fifteen 
selected  acoustic  cues  correlate  with  perceived  degrees  of  a foreign  American- 
English  accent  in  French. 

Chapter  5 discusses  the  results  obtained  from  the  present  experiment, 
highlights  the  limitations  of  this  study,  and  provides  some  implications  and 
future  directions  for  further  research. 


CHAPTER  2 
LITERATURE  REVIEW 


Atypical  speech,  be  it  pathologically  disordered  or  foreign  accented 
pronunciation,  has  been  investigated  for  numerous  decades  in  a wide  variety  of 
fields.  In  this  review,  three  areas  will  be  specifically  examined:  the  phenomenon 
of  infelicitous  pronunciation  within  the  framework  of  articulatory  phonetics,  some 
parameters  affecting  the  acquisition  of  a target  sound  system  in  the  formal 
setting,  and  the  assessment  of  foreign  accented  pronunciation.  The  paradigms 
of  these  interrelated  areas  will  be  applied  to  the  segmental  and  suprasegmental 
features  selected  for  this  research:  [ty]  in  tu  ‘you’,  [tu]  in  tout  ‘everything’,  [wi]  in 
oui  ‘yes’,  [Hi]  in  suite , and  [fe(z)atasjodoenoepaglise]  fais  attention  de  ne  pas 
glisser  ‘be  careful  not  to  slip’.  The  segmental  features  were  selected  for  the 
phonemic  and  allophonic  contrasts  they  exhibit  in  French  and  the  subsequent 
problems  they  represent  for  speakers  of  English  learning  French.  The  phrase 
was  selected  to  investigate  progress  in  the  acquisition  of  the  French  prosody. 

Articulatory  Phonetics 

Segmental  Features 

Segmental  errors  consist  of  mispronouncing  vowels,  glides,  liquids,  and 
consonants.  French  has  thirteen  vowels,  including  the  nasals,  and  English  has 
fourteen,  including  the  glides  (Delattre,  1996;  Denes  & Pinson,  1993).  For  the 
vowelsl,  French  and  English  exhibit  opposite  articulatory  habits  as  illustrated  in 
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Table  2.1  (e.g.,  Argod-Dutard,  1997;  Casagrande,  1984;  Casagrande  & 
Casagrande,  1996;  Delattre,  1965,  1966). 
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Table  2.1 

Main  Trends  between  the  French  and  English  Vowel  System 


FRENCH 

ENGLISH 

LIPS: 

LIPS: 

• strong  tendency  toward  fronting 
and  rounding 

• no  front  rounding 

TENSION: 

TENSION: 

• high  articulatory  effort;  stability 

• low  articulatory  effort;  tendency 

throughout  the  production  of  the 

towards  muscular  relaxation; 

vowel  ->clear  phonetic 

continuous  movement  throughout 

distinctness 

the  production  gliding 

• e.g.  capitAle  is  clear  differentiated 

• unstressed  vowels  tend  to  be 

from  capitOle 

reduced  e.g.  capitAI  indistinct  from 
capitOI 

VELUM: 

VELUM: 

• clearly  raised  or  lowered 

• NOT  clearly  raised/lowered 

Vowels  M and  [ul 

The  vowels  [y]  and  [u]  have  been  selected  due  to  their  different 
distribution  in  French  and  English.  In  French,  [y]  and  [u]  are  phonemically 
contrastive:  bu  [by]  ‘drunk’  vs.  bout  [bu]  ‘end’.  English,  on  the  other  hand,  while 
including  [u],  does  not  contain  [y]  in  its  repertoire;  its  phonetic  realization  by 
English  speakers  is,  therefore,  predicted  to  be  problematic.  Contrastive  French 
[y]  and  [u]  are  defined  according  to  the  three  conventional  articulatory 
parameters:  tongue  height,  tongue  anterior/posterior  position  and  lip  protrusion 
or  rounding  (Catford,  1988;  Kent  & Read,  1992). 
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The  phonetic  features  of  [y]  are  [+high,  +front,  +round]  while  those  for  [u] 

are  [+high,  - front,  +round].  They  contrast  in  one  specific  point  of  articulation,  in 

tongue  anterior/posterior  position:  [y]  is  palatal,  whereas  [u]  is  velar.  For  [y],  the 

tip  of  the  tongue  presses  against  the  lower  incisors  or  incisive  teeth,  while  the 

body  of  the  tongue  dorsum  is  pressed  against  the  hard  palate.  The  tongue  is, 

therefore,  high  up  in  the  mouth  and  the  incisor  separation  is  small.  For  [u],  the 

tip  of  the  tongue  varies  in  position  and  the  body  of  the  tongue  is  near  the  velum; 

the  point  of  constriction  is  located  in  the  velar  area. 

The  third  parameter,  lip  rounding,  is  usually  the  most  visible.  To  produce 

a French  [y],  from  [i],  the  lips  are  deliberately  rounded,  while  rigidly  maintaining 

the  tongue  position  of  [i].  Lip  rounding  is  a common  feature  to  both  [y]  and  [u]; 

however,  the  phonetic  realization  differs.  Catford  (1988)  claims  that: 

"There  is  normally  a close  correlation  between  tongue-height  and  degree 
of  rounding:  the  closer  the  vowel  the  smaller  the  labial  aperture  and  vice 
versa  [...]  [y]  and  [u]  have  very  close  rounding-a  very  small  aperture 
barely  the  diameter  of  a pencil  [...].  The  form  of  the  lip-rounding  often 
differs  according  to  whether  it  is  applied  to  back  vowels  or  to  front 
vowels.”  For  [u], " the  corner  of  the  lips  are  pushed  in  towards  the  center 
so  that  both  lips  are  pushed  forwards,  or  'pouted'.  They  form  a kind  of 
small  tunnel  in  front  of  the  mouth  [...]  the  cheeks  are  pulled  inwards,  and 
also  the  channel  between  the  lips  is  formed  by  the  inner  surface  of  the 
lips,  hence  'inner  rounding'  or  'endolabial'  (150-51).” 

For  the  second  type  of  rounding,  Catford  (1988)  recommends  vertically 

compressing  the  corners  of  the  mouth,  leaving  a small  central  channel  between 

the  lips,  of  a slit-like  flat  elliptical  shape  rather  than  actually  round.  This  type  of 

'outer  rounding'  or  'exolabial'  involves  the  outer  surface  of  the  lips,  as  for  front 

rounded  vowels,  like  [y]  and  [0], 
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Although  the  phoneme  lul  belongs  to  both  the  English  and  the  French 
sound  systems,  two  problems  have  been  documented.  The  first  one  is  that  in 
open  syllables,  English  NSs  tend  to  end  the  vowel  with  a non-phonemic  glide 
[w]  whereas  French  NSs  merely  lengthen  it.  The  second  problem  is  grounded  in 
the  tongue  anterior/posterior  position  or  point  of  constriction.  The  important 
difference  is  the  varying  distance  between  the  body  of  the  tongue  and  the  palate 
of  the  mouth,  since  that  is  what  determines  in  large  part  the  size  and  shape  of 
the  oral  cavity.  Phoneticians  have  described  the  French  [u]  as  not  only  more 
tense,  but  also  as  less  anterior  than  its  English  counterpart  (Delattre,  1953  and 
Valdman,  1976  in  Flege  & Hillenbrand,  1987).  When  using  "[...]  a closely 
rounded  version  of  the  oo  in  too  as  your  [u],  you  will  probably  notice  that  the 
tongue  doesn't  move  back  very  far  in  going  from  [y]  to  [u].  This  is  because  in 
most  varieties  of  English  the  oo  vowel  is  not  a very  back  vowel"  (Catford,  1988: 
127);  i.e.,  it  is  more  central  than  the  standard  French  [u].  Viewed  from  another 
perspective,  this  central  [u]  may  be  described  as  an  American  [y]  produced  with 
less  protrusion,  with  the  dorsum  of  the  tongue  lower  than  in  French  and  moving 
more  towards  the  back,  with  the  tip  of  the  tongue  free,  and  with  some  gliding 
(Casagrande,  1984;  Delattre,  1966). 

Diphthongs  Nil  and  [wil 

In  the  literature,  the  terms  semivowels,  dipththongs,  glides,  and 
approximants  are  used  interchangeably.  Gliding  describes  the  gradual 
movements  of  articulation,  starting  with  one  tongue-,  and  one  lip-position,  but 
ending  with  another.  The  feature  approximant  refers  to  the  clear  narrowing  of  the 
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vocal  tract  at  some  point  of  the  production,  but  not  closed  and  not  close  enough 
to  cause  frication.  The  term  semivowel  signifies  that  these  segments  are  vowel- 
like in  their  voicing  characteristics,  but  that  they  require  the  presence  of  a 
following  vowel  to  construct  a syllable;  in  other  words,  semivowels  are  [-syllabic]. 
The  diphthong  consists  of  a semivowel  + a full  vowel  necessary  to  become 
syllabic,  e.g.  [w]  + [i]  ->[wi],  For  the  purpose  of  this  research,  the  term  glide  will 
be  used  (Casagrande,  1984;  Kent  & Read,  1992). 

The  French  glides  [H,  w],  like  [y,  u],  are  phonemically  contrastive:  [Mi]  lui 
'him'  and  muette  [rrMCt]  'mute'  contrast  respectively  with  [Iwi]  Louis  and  [mw€t] 
mouette  'seagull'.  Phonetically  [H]  is  characterized  as  a labial-palatal  glide, 
while  [w]  is  characterized  as  labial-velar,  the  distinction  bearing  on  the  areas  of 
narrowing  by  the  tongue  position,  the  labial  narrowing  remaining  constant  (Carr, 
1993;  Casagrande  & Casagrande,  1996).  Syllable  [Hi]  starts  with  the  features 
[-•-front,  + high,  + round]  of  [y]  and  ends  with  the  features  [+front,  +high,  -round]. 
On  the  other  hand,  [wi]  starts  with  the  features  [-front,  + high,  + round]  of  [u]  and 
ends  with  [+front,  +high,  - round].  As  [y]  and  [u],  the  two  semivowels  [H]  and  [w] 
contrast  in  one  single  feature  [front].  As  [H]  is  absent  from  the  English  repertoire, 
learners  attempting  to  produce  it  may  erroneously  produce  [u]  or  [w]  as  the  initial 
point  of  articulation  in  lieu  of  [y]. 

Initial  voiceless  obstruent:  [p]  vs.  [ph]  and  ft]  vs.  [th1 
Voiceless  stops  [p]-[ph]  and  [t]-[th]  belong  to  both  sound  systems  and  are 
not  phonemically  contrastive  in  either  system.  The  laryngeal  setting  for 
voicelessness  refers  to  the  absence  of  vocal  fold  vibration,  and  aspiration  refers 
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to  a greater  rate  of  airflow  between  the  burst  release  and  the  onset  of  the  vocal 
fold  vibrations  of  the  following  vowel.  This  in  turn  translates  into  a longer 
duration  of  voicelessness  (Kent  and  Read,  1992;  Ladefoged  & Maddieson, 

1996).  However,  the  highly  incongruent  allophonic  distribution  of  this  phoneme 
leads  the  English  speakers  to  mispronounce  it  in  a French  linguistic  context.  In 
English,  initial  Ip  t k I are  realized  as  voiceless  aspirated  stops  having  long-lag 
voice  onset  time  (VOT),  whereas  French  word-  and  syllable-initial  Ip  t kl  are 
realized  as  voiceless  un-aspirated  stops  with  short-lag  VOT  values,  i.e. , aperture 
and  voicing  are  nearly  simultaneous.  In  final  position,  the  pattern  is  reverse; 
French  Ip  t VJ  are  aspirated  whereas  in  English,  these  final  stops  are  un- 
aspirated  (Casagrande  & Casagrande,  1996;  Flege,  1987;  Kent  & read,  1992). 
This  incongruent  allophonic  distribution  is  expected  to  lead  English  speakers  to 
errors  when  producing  these  phonemes  in  the  French  linguistic  context  under 
study,  i.e.,  /p/  in  pas  and  IV  in  attention. 

Prosodic  Features 

Prosody  has  been  defined  as  the  "fabric  of  speech,  within  which 
segments  are  the  individual  stitches  or  fibers"  (Kent  & Read,  1992:  152)  and  is 
responsible  for  essential  functions  in  communication;  prosody  encompasses 
properties  such  as  stresses,  rhythm,  speaking  rate,  and  intonation.  Variations  in 
prosody  are  traditionally  perceived  in  terms  of  loudness,  pitch,  and  duration. 
Listeners  rely  on  prosody  to  process  and  segment  the  stretch  of  discourse, 
establish  syntactic  groups,  and  interpret  the  content: 
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(1)  Picasso  a peint Z I 'enfant  en  bleu 

'Picasso  painted/  the  child  in  blue' 

(2)  Picasso  a peint  ['enfant/  en  bleu 

'Picasso  painted  the  child/  in  blue' 

In  the  first  example,  I'enfant  en  bleu  is  a constituent  functioning  as  direct  object 
whereas  in  the  second  example,  Tenfant ' is  the  direct  object.  Both  stresses  and 
pauses  help  the  listener  disambiguate  the  message. 

The  term  suprasepmental  denotes  the  decoding  of  longer  stretches  of 
utterance  hierarchically  constructed  and  constrained  from  syllables,  to  feet, 
words,  phonological  words,  rhythmic  or  breath  groups,  and  sentences  modulated 
by  superimposed  stresses  and  intonation  (Catford,  1988;  Delattre,  1966;  Kent  & 
Read,  1992).  Suprasegmental  errors  can  occur  in  timing,  rhythm,  intonation  and 
stress  (Anderson-Hsieh,  1992). 

The  following  section  defines  the  main  components  of  prosody  and 
reviews  these  components  in  light  of  English  and  French  constraints,  focusing 
on  how  stress  patterns  and  rhythm  can  affect  foreign  accented  pronunciation, 
i.e.,  the  infelicitous  presence  of  English  patterns  in  French. 

Stress.  The  nature  of  stress  is  to  highlight  the  relative  prominence  or 
salience  of  one  syllable  relative  to  the  adjacent  ones.  Stressed  syllables,  applied 
at  the  lexical  level,  are  usually  recognized  by  a higher  degree  in  loudness, 
duration  and/or  pitch  height,  i.e.,  faster  vocal  fold  vibrations.  Stressed- 
unstressed  syllable  alternation  contributes,  therefore,  to  the  construction  of  the 
prosody  of  a language.  The  function  of  stress  may  be  indicative  of  grammatical 
categorization  (e.g.,  con'  tent  vs.  'content),  of  a correlation  between 
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syntactic/semantic  structures,  of  emphatic  or  contrastive  intention  (Delais,  1994; 
Hayes,  1995;  Martin,  1987;  Pasdeloup,  1992;  Wenk  & Wioland,  1982). 

Stress  assignment  in  French  and  English  varies  radically.  On  the  one 
hand,  French  has  a primary  lexical  stress  on  the  last  full  syllable  of  a word, 
which  then  undergoes  temporal  lengthening.  This  categorizes  French  stress 
assignment  as  'right-headed  foot'.  This  main  stress-denoted  [*] — is  fixed  and 
broadly  applied:  pho'to,  photo'graphe,  photogra'phique,  photogra'phier.  In 
polysyllabic  utterances,  a secondary  weaker  stress-denoted  [,]-may  also  occur 
to  the  left  of  the  main  grammatical  stress  as  in  an,ticonstitutio'nnel  (Delattre, 
1966;  Vaissiere,  1991). 

On  the  other  hand,  in  English,  the  lexical  stress  assignment  is  free,  that  is 
variable  although  predictable.  It  can  be  assigned  to  the  first  syllable  as  in 
'Scotland,  'lexical,  'tendency.  It  can  be  assigned  to  the  second  syllable  as  in 
A'merica,  psychology.  It  can  also  be  assigned  to  the  third  or  antepenultimate 
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syllable  as  in  economical,  anticonsti’tutional.  In  both  cases,  syllables  are 
counted  from  the  right.  The  English  stress  assignment  responds  to  three  criteria: 
stress  is  sensitive  to  the  quantity  of  the  syllable,  it  is  left-headed,  and  is 
presumably  unbounded  at  the  level  of  the  phrase.  In  other  words,  a foot  could 
contain  an  unrestricted  number  of 'trimmed'  unstressed  syllables,  e.g.,  ‘Sandra 
‘flirted  with  the  delectable  ‘Frenchman  (Carr,  1993:  217).  Stress  falls  on  a heavy 
syllable  at  the  foot  level.  The  formation  of  a foot  is  predicated  on  the  weight  or 
quantity  of  the  syllable.  In  polysyllabic  words,  a secondary  stress  may  be  found 
on  the  left  of  the  main  stress,  as  in  a.mericani' zation,  and  in  compound  words,  it 
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may  be  found  on  the  right,  as  in  ‘teachers, husband]  this  contrasts  with  French 
which  has  secondary  stresses  only  on  the  left . 

Furthermore,  as  they  opposed  each  other  in  lexical  stress  assignment,  the 
two  languages  also  differ  as  to  the  realization  of  unstressed  syllables  in 
connected  speech.  In  French,  while  the  primary  stress  is  shifted  or  demoted  by 
reason  of  the  concatenation  of  segments,  the  quality  of  the  vowels  remains 
unaltered,  i.e. , the  vowels  are  not  neutralized.  There  is  one  significant  exception 
to  this  general  rule:  the  schwa.  Schwa  is  inherently  stressless  except  in  the 
imperative  mode  dis-le  'say  it'  vs.  vous  le  dites.  In  unstressed  position,  schwa 
can  be  either  produced  as  [0]  or  [ce]  as  in  ville  de  Paris  [vildcepari];  or  it  can  be 
deleted  as  in  doigt  de  pied  [dwadpje].  Schwa  can  also  be  inserted  (epenthesis) 
as  in  ours  blanc  [ursoebla].  The  phonological  rules  that  govern  the  release, 
insertion,  or  deletion  of  schwa  are  irrelevant  to  this  study.  For  the  present 
purpose,  it  suffices  to  know  that  typically,  schwa  deletes  whenever  possible, 
contingent  upon  the  speaking  rate,  the  style,  and  the  language  constraints.  The 
degenerate  foot  resulting  from  Schwa  deletion  triggers  resyllabification  with  the 
adjacent  consonant.  For  instance,  it  feeds  voicing  assimilation  which  alters  the 
prosodic  parameters  of  duration  and  loudness  as  in:  penses-tu  [pasoety]  -> 
[pasty]  and  pense-bete  [pascebs  t]  vs.  [pazbs  t]  (Casagrande,  1984;  Tranel, 

1987). 

In  English,  at  phrasal  and  sentential  levels,  stress  shifting  is  constrained 
(Goldsmith,  1990;  Flayes,  1995).  The  concatenation  of  segments  triggers  the 
systematic  reduction  of  unstressed  vowels  contained  between  stressed 
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syllables,  i.e.  between  feet.  For  instance,  in  'Sandra  ‘flirted  with  the  de’lectabie 
‘Frenchman  (Carr,  1993:  217),  all  unstressed  vowels  are  neutralized  to  schwa. 

Rhythm.  The  Greek  psco  'to  flow'  indicates  that  rhythm  implies  harmony 
within  movements,  that  is,  repetition  or  periodicity  between  stressed  and 
unstressed  syllables;  rhythm  refers  to  the  timing  mechanisms  of  languages 
(Fraisse,  1956).  Languages  are  broadly  distributed  between  two  types  of  rhythm: 
stress-timed  and  syllable-timed,  although  Anderson-Hsieh  (1992:  52)  claims  that 
research  "has  not  supported  a simple  categorization  of  languages  into  these  two 
types  of  rhythm.”  At  most,  tendencies  toward  one  system  or  the  other  can  be 
recognized. 

English  rhythm  is  categorized  as  stress-timed,  with  “isochronous  feet” 
(Vaissiere,  1991:109),  i.e.,  stressed  syllables  occur  at  fairly  regular  intervals 
regardless  of  the  number  of  unstressed  syllables  " bunched  up  together,  so  that 
the  stresses  remain  equidistant  from  each  other"  (Carr,  1993:  217).  This 
tendency  toward  isochronous  rhythm,  that  is,  even  tempo  or  interval  between 
stresses,  triggers  the  deletion  and/or  shifts  of  lexical  stresses  as  well  as  the 
reduction  and  neutralization  of  unstressed  vowels,  as  mentioned  earlier. 

Conversely,  French  rhythm  is  categorized  as  syllable-timed  with 
“isochronous  syllables”  (Vaissiere,  1991:109).  French  speech  is  composed  of 
feet  which  themselves  consist  of  one  or  several  equally  weighted  syllables  - 
rarely  more  than  six-,  the  last  one  being  marked  by  a slightly  stronger  stress, 
showing  therefore  prominence  over  the  preceding  syllables.  Different  phrases 
containing  an  equal  number  of  syllables  will  take  approximately  the  same 
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amount  of  time  to  be  uttered,  the  speaker’s  pace  remaining  constant 
(Casagrande  & Casagrande,  1996;  Delattre,  1966).  However,  Wenk  and 
Wioland  (1982)  claim  that  French  is,  at  best,  regulated  group-finally,  which  is  not 
necessarily  synonymous  with  syllables  of  equal  length.  Their  argument  is  that 
group-final  syllables  are  nearly  twice  as  long  as  non-final  syllables.  However, 
the  distribution  of  groups  within  a sentence  has  not  been  uniquely  defined 
(Delais,  1994;  Pasdeloup,  1992). 

Although  different  in  the  parameters  that  govern  their  respective  rhythms, 
both  French  and  English  tend  to  comply  with  research  suggestions  on  the 
formalizing  of  eurhythmy  across  languages.  They  both  seem  to  support  a 
preference  for  rhythmic  symmetry,  i.e.,  eurhythmy.  However,  English  would 
express  it  with  dominant  accentuation  whereas  French  prefers  temporal 
organization.  The  two  languages  also  tend  to  exhibit  similar  rise-fall/high-low 
contours  in  declarative  sentences.  A rising  pitch  or  stress  denotes  the  beginning 
of  an  utterance  as  well  as  its  uncompleted  or  open-ended  condition.  A falling 
pitch  denotes  the  end  of  the  utterance  (Pickering,  1994:  Vaissiere,  1991,  1995; 
Wenk  & Wioland,  1982).  The  intonation  contour  for  command  forms  in  French, 
i.e.,  the  imperative  mode,  follows  a high-low  contour.  However,  a low-high  stress 
pattern  can  also  be  expected  to  denote  impatience  or  anxiety  (Valdman,  1993). 

Predictions.  Based  on  similarities  and  contrasts  between  the  French  and 
the  English  prosodic  systems,  the  following  hypotheses  are  suggested 
concerning  the  production  of  the  phrase  fais  attention  de  ne  pas  glisser 
[fe(z)atasjodcencepaglise]  ‘be  careful  not  to  slip’  by  both  French  speakers  and 
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American-English  speakers  learning  French  (Tables  2.2-2. 3): 

• Voice  Onset  Time  (VOT).  English  speakers  are  expected  to  produce  word- 
initial  voiceless  [p]  in  pas  and  syllable-initial  [t]  in  attention  with  aspiration, 
i.e. , longer-lag  VOT. 

• Stressed  syllables.  French  speakers  will  put  a lexical  stress  on  the  following 
underlined  syllables:  fais  attention  de  ne  pas  glisser.  English  speakers  are 
expected  to  glide  the  stressed  vowels  [fe]  and  [se].  The  stress  and  higher 
pitch  on  'fais'  are  grounded  on  the  imperative  mode.  The  stress  on  'pas'  is 
emphatic.  The  stresses  on  '-tion'  [sjo]  and  '-sser'  [se]  respectively  mark  the 
end  of  a foot  and  of  the  phrase.  Within  the  French  prosodic  system,  these 
syllables  are  expected  to  be  longer.  American-English  speakers  are  expected 
to  raise  the  pitch  in  [pa],  to  further  increase  the  duration  and  to  produce  a 
glide  in  both  [sjo]  and  [se], 

• Unstressed  syllables.  While  unstressed  [a]  in  attention  retains  its  full  vowel 
quality  in  French,  American-English  speakers  may  reduce  it  to  schwa.  The 
unstressed  '-tten'  [ta]  is  expected  to  be  short  with  low  pitch;  conversely, 
American  speakers  would  mark  it  as  lexical  stress  with  longer  duration  and 
higher  pitch. 

• Rhythm.  Infelicitous  production  may  be  grounded  in  the  disharmony  at  the 
rhythmic  level;  instead  of  even  tempo,  with  equidistant  1 2 3 4-1  23  5(4),  the 
second  scenario  offers  123-1234-1.  In  other  words,  intervals  between 
stressed  syllables  would  reflect  the  English  lexical  stress  in  [ta]  and  [gli].  In 
this  instance,  the  stressed  syllables  would  not  be  evenly  paced  any  longer: 
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eurhythmy  is  broken 
Table  2.2 

Prosodic  realization  by  native  speakers  of  French 


Syllables 

fais 

a 

tten 

tion 

de 

ne 

pas 

gii 

sser 

IPA  transcription 

fe 

(z)a 

ta 

sjo 

dee 

n(ce) 

pa 

gii 

se 

Pitch  variations 

+ 

+ 

+ 

+ 

Stress 

(+) 

+ 

+ 

+ 

Interval 

4 

5(4) 

Table  2.3. 

Prosodic  realization  American-English  speakers  learning  French 


Syllables 

fais 

a 

tten 

tion 

de 

ne 

pas 

gii 

sser 

IPA  transcription 

fej 

(z)ce 

,h~ 

t a 

sjo 

dee 

nee 

ii 

p se 

giij 

sej 

Pitch  variations 

+ 

+ 

+ 

+ 

Stress 

(+) 

+ 

+ 

+ 

Interval 

3 

4 

i 

Having  briefly  compared  the  phonetic  characteristics  of  the  elements  of 
speech  under  investigation,  the  next  two  sections  describe  some  factors 
affecting  the  monitoring  of  the  acquisition  of  a target  sound  system  and  two 
methods  to  evaluate  progress  over  time  and  training  in  the  formal  setting. 
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Foreign  Accented  Pronunciation  in  Second  Language  Acquisition  (SLA) 
Between  the  native  language  (LI ) and  the  acquisition  of  the  target 
language  (L2)  lies  the  domain  of  interlanguage  (IL).  Research  in  IL  phonology 
addresses  the  pervasive  question  of  adult  L2  speech  acquisition  process  and 
degree  in  L2  ultimate  attainment  (loup,  1987;  Tarone,  1987;  Young-Scholten, 
1995).  Studies  showed  that  the  negative  transfer  from  LI  to  L2  is  insufficient  to 
explain  infelicitous  L2  speech.  Transfer  has  henceforth  been  redefined  as  a 
cross-linguistic  influence  that  encompasses  a variety  of  constraints  wider  than 
merely  LI . SLA  research  on  problems  encountered  by  adults  learning  the  L2 
sound  system  has  spawned  dozens  of  models  and  theories  (Celce-Murcia,  1991; 
Ellis  1994;  Odlin,  1993;  Tarone,  in  loup  1987).  In  this  section,  the  monitoring  of 
progress  in  adult  L2  speech  will  be  discussed  within  the  following  framework:  (1) 
Paradigms  of  perceptual  classification  and  L2  phonetic  mapping  and  (2)  the 
effect  of  formal  instruction  and  technology  on  L2  speech  acquisition.  These 
theories  will  be  applied  to  the  data  selected  for  the  present  research . 

The  Perceptual  Classification  and  L2  Phonetic  Mapping 

Studies  by  Flege  (1981)  and  Ziolkowski  et  al.  (1992)  suggested  the 
presence  of  two  phonologies  in  the  L2  acquisition  process:  learners  were 
assumed  to  use  one  phonology  for  perception  and  another  for  production,  with 
specific  training  techniques  affecting  the  relationship  between  the  two.  L2 
perception  is  defined  as  the  identification,  decoding,  and  interpretation  of 
speech  signals.  There  is  abundant  evidence  that  the  beginning  learner  makes 
perceptual  reference  to  the  native  language  (LI)  phonetic  categories,  using 
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them  to  filter  L2  sounds.  Studies  show  that  in  general,  adult  L2  learners  have 
difficulty  perceiving  L2  phonetic  contrasts  that  are  not  functional  (non-phonemic) 
in  their  LI.  Rochet  (1995)  showed  that  English  native  speakers  (NS)  assign  both 
French  /y,  u / to  their  single  category  [u]  while  French  NSs  assimilate  English  [0] 
to  either  [fj  or  [s],  Fiowever,  this  difficulty  varies  in  degree  depending  on  factors 
such  as  age,  kinaesthetic  feedback  (i.e. , the  perception  of  one’s  own 
movement),  motivation,  exposure  to  or  training  in  L2,  and  L2  frequency  use 
(Beddor&  Gottfried,  1995;  Borden  & Harris,  1984;  Flege,  1987,1992;  Leather  & 
James,  1996). 

In  his  Speech  Learning  Model-SLM , Flege  (1995)  claims  that  "without 
accurate  perceptual  'targets'  to  guide  the  sensorimotor  learning  of  L2  sounds, 
production  of  L2  sounds  will  be  inaccurate.  [...].  L2  production  errors  have  a 
perceptual  basis  [...].  The  production  corresponds  to  the  properties  represented 
in  its  phonetic  category  representation"  (238-39).  L2  learners  use  a system  of 
equivalence  classification  to  interpret  and  categorize  L2  sounds  as  new,  similar, 
or  identical.  The  model  predicts  that  L2  learners  will  perceive  more  successfully 
new  sounds  rather  than  similar  ones  because  of  their  noticeable  acoustic 
difference.  A target  similar  or  equivalent  L2  sound,  i.e.,  perceived  as  an 
allophonic  variation,  is  merely  approximated  (Flege,  1995).  Flege  (1987)  shows 
data  supporting  the  hypothesis  that  English  NSs  produced  better  new  French  [y] 
than  similar  [u].  Leather  & James  (1996)  also  suggest  that  L2  learners  are  more 
successful  when  they  must  construct  a category  from  scratch  than  when  they 
vaguely  approximate  an  equivalent  sound.  There  is  anecdotal  evidence  that  non- 


20 


phonemic  contrasts  such  as  [p]/[ph]  and  [t]/[th]  are  traditionally  ignored  by  both 
English  NSs  learning  French  and  French  NSs  learning  English.  Models  such  as 
Best's  Perceptual  Assimilation  Model-PAM  (1995)  and  Kuhl  & Iverson  (1995) 
Perceptual  Magnet  Effect  (ME)  model  equally  aim  at  predicting  the  difficulty  of 
an  L2  unit  based  on  perceptual  discrimination,  proximity,  attraction,  or  distance 
between  LI  and  L2  magnets  (phonetic  spaces). 

However,  other  studies  exhibit  conflicting  results.  Bohn  (1995)  notes  that 
non-native  speakers  select  certain  types  of  acoustic  cues  regardless  of  their  LI 
experience,  thus  advocating  an  LI -independent  perception  device. 

In  addition,  there  are  other  problems.  First,  accurate  L2  perception  does 
not  automatically  imply  a reciprocal  relationship  with  accurate  L2  output 
(Neufeld,  1987).  Consequently,  there  is  no  a priori  reason  to  view  perception  and 
production  as  a hard-wired  connection  in  the  speech  system  (Leather  & James, 
1996;  Strange,  1995).  Secondly,  the  question  remains  as  to  how  much  an  L2 
sound  needs  to  differ  from  an  LI  sound  to  be  labeled  new,  similar,  or  equivalent 
while  accounting  for  idiosyncratic,  stylistic  and  dialectal  variances. 

The  complexity  of  the  perception-production  relationship  in  prosody  has 
been  underscored  (Odlin,  1993).  Prosodic  features  are  among  the  most 
important  typological  distinctions  between  languages;  they  carry  phonemic, 
syntactic,  semantic,  and  discourse  functions;  they  play  a bootstrapping  role  in 
infant's  LI  acquisition.  In  other  words,  the  perception  of  prosodic  variations  in 
the  native  language  helps  the  child,  at  a very  early  age,  locate  important 
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syntactic  constituents  (Jusczyk  et  al.,  1995).  Paradoxically,  the  sensitivity  to  and 
monitoring  of  the  L2  prosodic  organization  seems  to  be  mastered  last. 

The  perceptual  classification  paradigm  is,  therefore,  not  sufficient  to 
account  for  L2  sound  processes  and  progress. 

The  Effect  of  Formal  instruction  and  Technology  on  L2  Speech  Acquisition 

Interlanguage  phonology  is  highly  sensitive  to  changes  in  communicative 
interaction.  Such  changes  are  related  to  a variety  of  individual  and  external 
factors.  Three  of  them  are  pertinent  to  the  research:  the  formal  learning  context, 
the  impact  of  technology,  and  the  learner  as  active  participant  in  the  L2  learning 
process.  These  internal  and  societal  constraints  underlie  the  learners'  decision 
to  either  facilitate  or  restrict  L2  speech  acquisition  (Dickerson,  1976;  Ellis,  1994; 
Gass  & Varonis,  1994;  Tarone,  1987). 

Formal  instruction  in  L2  speech  is  predicated  upon  the  concepts 
promoting  communicative  language  teaching  (CLT).  Among  its  tenets,  this 
teaching  philosophy  emphasizes  a learner-  over  teacher-centered  environment 
and  focuses  on  the  use  of  authentic  material  and  multi-sensory  modes  to 
enhance  the  use  of  L2  for  meaning  and  function  (Celce-Murcia  et  al.,  1997;  Ellis, 
1994:  Leather  & James,  1996;  Morley,  1993;  Riggenbach  & Lazaraton,  1991). 

Until  recently,  it  seemed  that  the  teaching  of  pronunciation  was  limited  to 
the  mimicry  of  isolated  words,  phrases  and  sentences,  relying  heavily  on  the 
intensive  use  of  the  International  Phonetic  Alphabet  (IPA).  However,  the 
development  of  the  CLT  philosophy  has  delineated  the  place  and  role  of 
pronunciation  within  communicative  language  proficiency,  and  in  so  doing,  has 
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opened  the  way  for  new  alternatives  adaptable  to  new  methodologies  and  to 
modern  technology.  Specifically,  CLT  classes  tend  to  accentuate  the  link 
between  perception  and  production,  using  listening  discrimination  and 
identification  exercises,  and  to  increase  attention  on  the  prosodic  features  of 
natural  speech  (speaking  rate,  intonation,  stress  patterns)  (Casagrande  & 
Casagrande,  1996;  Leather  & James,  1996). 

The  role  of  technology.  One  of  the  major  problems  in  the  classroom  lies  in  the 
reduced  input  typical  to  formal  instruction:  “teacher  talk”,  limited  exposure  to 
natural  input,  and  inconsistent  feedback.  As  a remedy,  language  teachers  agree 
that  media  can  enhance  language  learning  (Celce-Murcia  et  al.,  1997).  Practice 
moves  the  new  sounds  from  uncontrolled  perception  to  conscious  monitoring,  to 
automaticity,  i.e.,  centrally  pre-programmed  memory  (Leather  & James,  1996). 
Modern  technology,  such  as  speech  spectrographic  devices  and  built-in  voice 
analyzers,  provides  immediate  auditory  and  visual  biofeedback,  modifies  speech 
perception,  and  enhances  selective  attention  and  self-assessment  (Morley, 

1993;  Ziolkowski  et  al.,  1992).  The  split-screen  display  of  programs  such  as 
Visipitch  (Kay  Elemetrics-New  Jersey),  allows  learners  to  visually  and  aurally 
compare  their  output  with  a model  and  to  analyze  intonation  patterns,  length  of 
utterance,  pauses,  aspiration,  word  and  sentence  stress.  The  optic  flow  is  used 
to  interpret  acoustic  cues  (Andersen-Hsieh,  1992;  Celce-Murcia  et  al.,  1997; 
Landhal  & Ziolkowski,  1995;  Molholt,  1998;  Morley,  1993).  In  addition,  this  type 
of  software  provides  the  means  to  store  data  for  further  analyses. 
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The  users  of  such  technology,  the  learners,  are  individually  loaded  with  a 
unique  background:  native  language  (LI),  exposure  to,  motivation  and  attitude 
toward  L2,  age  of  learning  (AOL),  cognitive  level,  learning  strategies,  perceptual 
acuity,  etc.  Among  these  variables,  the  maturational  issue  is  a never-ending 
debate.  Research  shows  that  discrimination  between  L2  contrastive  sounds  vs. 
non-contrastive  in  LI  decreases  as  AOL  increases  (Flege,  1995).  However, 
other  studies  show  that  the  mechanisms  used  in  LI  learning  remain  intact  and 
available  for  L2  learning  regardless  of  the  learner’s  age.  Schachter  (1996) 
further  asserts  that  there  exists  empirical  evidence  that  non-exceptional  adult  L2 
learners  can  attain  near-native  phonological  competence.  Concerning 
motivation,  Dickerson  (1987)  pointed  out  that  learners  exhibit  a natural 
propensity  to  acquire  L2  speech,  feeding  the  learner's  decision  toward  the 
conscious  monitoring  of  L2  pronunciation.  A profusion  of  anecdotal  evidence 
confirms  this  natural  inclination.  Students  in  pronunciation  classes  consistently 
express  their  desire  to  improve  their  accent  and  ‘sound  native’. 

Consequently,  the  type  of  tasks  assigned  is  significant.  Leather  & James 
(1996)  suggest  that  isolated  minimal-pair  activities  produce  different  results  than 
when  pairs  are  embedded  in  connected  speech.  Imitation  tasks  reduce  the 
memory  load  and  facilitate  immediate  accurate  production,  but  this  may  not 
guarantee  long-term  use  in  spontaneous  speech.  Information-gap  tasks  and  field 
assignments  appear  to  help  focus  on  pronunciation.  Since  these  tasks  require 
efficient  communication,  attention  to  L2  accurate  speech  rises  (Crookes  & 
Chaudron,  1991;  Gass  & Varonis,  1994;  Pica  et  al. , 1991).  Different  tasks  may 
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require  different  styles.  Styles  follow  a continuum:  from  casual,  (vernacular  and 
uncontrolled),  to  formal  and  closely  monitored  style.  Usually,  the  more  formal 
the  tasks,  the  more  the  learner  tends  to  monitor  L2  speech,  e g.  reading  vs. 
extemporaneous  speech  (Tarone,  1985).  In  such  instances,  style  affects  the 
speaking  rate:  the  more  casual  the  style,  the  faster  the  rate.  It  is  usually  admitted 
that  NS  listeners  react  more  positively  to  accented  pronunciation  when  produced 
at  slower  rates.  However,  Munro  & Derwing  (1998)  claim  that  “slowing  down  may 
not  help  second  language  learners”  (159).  Speaking  rate,  voiced/  voiceless 
contrasts,  aspiration,  vowel  reduction  and/or  deletion,  all  are  features  of  L2 
speech  that  can  be  monitored  using  technology  presently  available. 

In  short,  the  current  rationale  behind  formal  instruction  is  that 
pronunciation  is  viewed  as  a functional  component  of  communication 
effectiveness  and  fluency,  based  on  exemplars  construed  from  selective 
attention  and  consciousness,  and  enriched,  structured  exposure  (Flege,  1995; 
Rochet,  1995).  Formal  instruction  together  with  computer-driven  tasks  and 
meaningful  biofeedback  answer  the  learners’  desire  to  acquire  L2  speech, 
provide  the  means  to  simplify  complex  learning  processes,  speed  up  the  rate  of 
acquisition,  and  improve  the  quality  of  the  ultimate  product.  However,  analysts 
also  emphasize  the  fact  that  scientifically  controlled  research  in/on  classroom 
SLA  instruction,  specifically  studies  on  long-term  retention  and  on  the  acquisition 
of  L2  prosodic  patterns,  has  received  little  attention  (Cohen,  1994;  Ellis,  1994; 
Hardy,  1993;  Larsen-Freeman  & Long,  1993;  Long,  1996;  Schmidt,  1990; 
Sharwood  Smith,  1981). 


25 


Assessment  of  Foreign  Accented  Pronunciation 
The  objective  pursued  in  assessing  foreign  accented  pronunciation  is  to 

(1)  diagnose  and  label  the  speaker's  production  according  to  levels,  or  scales; 

(2)  set  protocols  of  practice;  and  (3)  evaluate  the  outcomes  of  the  practice.  The 
assessment  of  foreign  accented  pronunciation,  therefore,  lies  at  the  confluence 
of  three  sets  of  parameters:  the  relationship  between  the  speaker  and  the 
listener;  the  task  elicitation  or  speech  sample;  and  the  methodology  used  to 
evaluate  speech  in  compliance  with  validity  and  reliability  criteria.  This  section  is 
organized  as  follows:  a succinct  delineation  of  the  parameters  involved  in 
speech  and  two  methodologies  used  in  speech  assessment:  perceptual  and 
instrumental  ratings  and  their  respective  utilization  in  Second  Language 
Acquisition  (SLA). 

Assessment  of  Speech:  The  Physics  of  Sounds 

Speech  may  be  defined  as  a dynamic  stream  of  connected  sounds- 
broadly  classified  as  vowels  and  consonants-linguistically  organized  and 
constrained.  Sound  corresponds  to  the  vibratory  movements  of  molecules  of  air 
leaving  the  speaker's  lips,  traveling  in  the  atmosphere  to  create  sensations  in 
the  listener's  ears.  Produced  by  movements,  speech  answers  to  physical 
principles  and  laws.  From  the  various,  sequential  phases  of  speech- 
neurolinguistic  programming,  neuromuscular,  organic,  aerodynamic,  acoustic, 
neuroreceptive,  and  neurolinguistic  identification  (Catford,  1988;  Kent  & Read, 
1992)—,  the  sound  waves  constitute  the  intersection  between  the  speaker's 
intended  encoded  message  and  the  listener's  decoding  process  (Figure  2.1). 
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Figure  2.1  Illustration  of  the  production  and  perception  processes  of  speech 
(Compiled  from  Baken,  1987,  Borden&  Harris,  1984;  Denes  & Pinson,  1993) 


Sound  waves  are  therefore  a major  component  of  speech  in  the 
assessment  of  voice  signals.  The  speech  stimulus  displays  a great  deal  of 
physical  fluctuations  due  to  factors  such  as  inter-,  intra-speaker  variability  (idio- 
or  socio-dialects,  emotional  state),  speaking  rate,  environmental  events  (noise, 
reverberation,  speaker-listener  distance),  telephone  interferences,  differences  in 
voice  quality  and  speech  disorders  (Laver,  1991;  Pisoni  & Lively,  1995).  In  spite 
of  all  this,  the  human  mind  has  developed  into  a "remarkable  seeker  and 
organizer  of  patterns.  It  receives  a seemingly  chaotic  variety  of  sights,  sounds 
and  textures,  searches  for  the  common  properties  among  them,  makes 
associations,  and  sorts  them  into  groups  [...]"  (Borden,  1984:166).  Speech 
judgments  are  an  attempt  to  answer  the  following  question:  Beyond  the 
variations  that  speakers  exhibit,  how  do  we  perceive  and  interpret  speech,  to 
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ascertain  a degree  of  deviation  as  atypical,  e.g.,  a 'foreign  accent'?  The 
assessment  of  sound  waves  is  limited  here  to  two  paths.  Auditory-perception 
consists  of  the  decoding  process  by  the  listener  of  informational  messages 
provided  by  the  speaker.  The  second  path,  instrumentation,  implies  the  use  of 
acoustic,  aerodynamic,  and/or  kinematic  analysis  to  measure  the  variables 
involved  in  the  supralaryngeal,  laryngeal,  and  sublaryngeal  systems  of  speech 
production,  all  of  them  encoded  in  the  sound  waves. 

Auditory-Perceptual  Assessment 

This  sub-section  will  briefly  review  the  theories,  methods,  advantages  and 
disadvantages  of  this  method,  including  its  place  and  role  in  SLA. 

Theories:  The  biophysical  characteristics  of  speech  perception 

When  raters  use  the  auditory-perceptual  method  to  evaluate  a speech 
signal,  they  typically  compare  the  stimulus  to  an  internal  standard,  or  scale, 
which  they  individually  construct  according  to  their  experience  (Kreiman  et  al. 
1993).  The  auditory  mechanism  is  very  sophisticated  and  is  remarkable  at 
transforming  acoustic  energy,  via  mechanical  energy  through  the  impedance 
matching  device  of  the  ossicles  in  the  middle  ear,  into  hydraulic  energy  through 
the  fluid  of  the  inner  ear  and  into  the  brain.  Speech  perception  uses  the  auditory 
mechanism  not  only  to  detect,  but  also  to  further  interpret  signals  loaded  with 
subtle,  complex,  and  robust  acoustic  information  in  order  to  negotiate  the 
meaning  implied  in  the  uttered  message.  Coarticulation  counts  among  the  main 
processes  that  explain  how  the  brain  disambiguates  speech  signals  (Borden  & 
Harris,  1984;  Kent  & Read,  1992). 
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Coarticulation  provides  constant  information  about  ongoing  signals;  it 
means  that  two  articulators  move  simultaneously  for  different  phonemes.  It  is  the 
key  that  makes  "speech  transmission  rapid  and  efficient  as  a code"  (Borden  & 
Harris,  1984:  131)  and  transforms  a string  of  juxtaposed  segments  or  phonemes 
into  a dynamic  and  continuously  changing  signal.  The  robustness  of  the  speech 
enables  the  listener  to  decode  sounds  even  under  difficult  noisy  conditions,  such 
as  a cocktail  party.  Thanks  to  this  redundancy  phenomenon,  not  all  cues  are 
indispensable.  Analysis  by  speech  synthesis  shows,  for  example,  that  only  the 
two  lowest  formant  frequencies  are  necessary  to  perceive  vowels 

Perception  of  prosody  needs  to  be  specially  mentioned.  Suprasegmentals 
serve  essential  functions  in  communication"  (Kent  & Read,  1992:152);  they 
bind  segments  together  and  modulate  information  above  the  stream  of  vowels 
and  consonants.  Suprasegmentals  are  defined  as  intonation,  speaking  rate,  and 
stresses  (lexical,  contrastive,  affective,  etc).  Stresses  are  perceived  in  terms  of 
pitch  fluctuations,  duration  and  loudness  or  intensity  contrasts.  The  assignment 
of  stresses  at  lexical  and  sentential  levels  sets  the  rhythm  and  helps 
disambiguate  the  signal.  However,  the  salience  of  one  cue  over  others  is 
controversial  (Borden  & Harris,  1984;  Denes  & Pinson,  1993;  Delattre,  1966; 

Kent  & Read,  1992). 

Methodology:  Raters,  tasks,  and  scales 

Typically,  perceptual  rating  consists  of  the  interaction  between  the  sound 
waves  of  the  stimuli  and  the  listener's  reaction  to  them,  i.e.,  what  the  stimulus 
evokes  in  the  listener.  Rating  implies  that  "the  overall  impression  a listener 
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receives  [...]  can  be  decomposed  into  several  perceptually  distinct  aspects 
corresponding  to  various  terms  [...].  It  is  assumed  that  individual  listeners  can 
focus  their  attention  on  these  aspects  of  the  stimuli,  and  can  make  the 
judgments  required.  Finally  and  crucially,  it  is  assumed  that  characteristics  of 
the  measurement  tool  remain  constant  across  listeners"  (Kreiman  & Gerratt, 
1998:  1598).  In  essence,  this  constitutes  the  paradigm  for  assessing  speech. 

The  listener's  profile  is  undetermined;  so  is  the  number  of  raters  required 
for  an  experiment.  Research  by  Kreiman  et  al.  (1993)  shows  that  the  number  of 
listeners  involved  in  rating  sessions  may  range  from  1 to  461.  Their  respective 
level  of  expertise  may  vary  from  experts  in  the  field,  i.e.,  experienced  speech 
pathologists  and  phoneticians,  to  graduate  and  undergraduate  students,  to  naive 
adults  and  sometimes  children.  The  listeners’  training  varies  along  a continuum 
ranging  from  none  to  basic  or  formal  orientation,  to  extensive  training  over  years 
of  practice. 

The  speech  stimuli  (sustained  phonation,  syllable,  word  or  sentence 
utterances)  can  be  live  or  audio-recorded.  In  this  case,  ideally,  they  are 
digitized,  manipulated  on  computer  for  filtering,  segmentation  and  normalization 
for  peak  voltage,  and  randomized  before  being  rated.  The  ratings  are  logged  in 
scales  (Kreiman  et  al.  1993;  Schiavetti,  1992). 

Advantages  and  disadvantages 

Auditory-perceptual  ratings  offer  the  advantage  of  being  convenient,  cost 
effective  and  robust,  that  is  the  auditory  system  is  able  to  decode  speech  signals 
somehow  regardless  of  noise  and  interference  (Kent,  1996).  However,  many 
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problems  with  auditory  perception  have  been  highlighted  in  the  literature.  Baken 
(1987),  Kent  (1996)  and  Kreiman  et  al.  (1993,  1998)  suggest  that  although  there 
is  no  technology  capable  of  matching  the  complexity  of  the  human  auditory  and 
neurological  systems  to  decode  speech,  listening  is  insufficient.  The  main 
reason  is  that  the  auditory  system  is  inherently  configured  to  deal  with  the 
speech  signal  as  a whole  entity:  "The  ear  is  too  easily  fooled"  (Baken,  1987:  2). 

Essentially,  the  problem  lies  in  the  lack  of  reliability  due  to  intra-  inter 
rater  judgment  variability.  Numerous  factors  may  influence  judgments  (Kreiman 
et  al.,  1993): 

listener  factors  < > task  factors 

experience,  sensitivity  scale  resolution,  context  effects 

bias,  error  samples 

Raters  disagree  not  only  on  the  terminology  they  use  to  describe  speech 
problems,  but  also  on  the  perceptual  dimensions  corresponding  to  a particular 
problem;  terms  remain  rather  subjective  (Baken,  1987).  To  further  complicate 
matters,  it  is  traditionally  admitted  that  phonetic  alphabets,  such  as  the  IPA,  are 
only  tentative.  There  is  no  complete  knowledge  of  the  phonetic  structure  of  any 
language,  and  typically,  languages  are  constantly  evolving,  so  an  absolute 
transcription  of  what  is  heard  is  an  illusion  (Ladefoged  & Maddieson,  1996). 

Another  reason  for  variability  stems  from  a tendency  to  'mishear'  speech 
signals.  As  Weismer  and  Martin  (1992)  comment  that  the  "...  deficits  are  as 
much  in  the  ear  of  the  listener  as  they  are  in  the  mouth  of  the  speaker"  (68). 

Kent  (1996)  explains  some  of  these  inaccuracies  as  due,  for  example,  to 
phonetic  expectation  triggered  by  predictability  from  context  or  by  the  listener's 
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linguistic  experience;  the  speaking  rate,  known  to  alter  the  phonetic  realization 
of  phonemes;  or  simply  the  listener's  fatigue  and  attention  lapses.  The  rater  may 
also  be  influenced  by  the  speaker's  physical  appearance,  or  by  a degree  of 
familiarity  with  the  speaker,  in  which  case  the  rater  unconsciously  'recalibrate' 
internal  standards  by  simple  habituation.  Measures  to  compensate  for  this 
subjectivity  include  the  insertion  of  anchor  stimuli  to  calibrate  and/or  reset  the 
listener’s  template  (Kent,  1996). 

A most  crucial  explanation  for  inter-,  intra-rating  variability  results  from 
methodology  or  procedures  used.  Judgments  vary  according  to  the  nature  (class 
of  sounds:  [s]  vs.  [r]),  the  presentation  (audio-recording  vs.  live  conditions),  and 
the  severity  of  the  stimulus.  For  example,  Kreiman  et  al.  (1993,  1998)  noticed 
that  raters  usually  agree  at  scale  endpoints  or  extremes.  Conversely,  "Interrater 
agreement  levels  were  consistently  poor  in  the  midrange  of  the  rating  scales" 
(Kreiman,  1998:1601)  and  in  scalar  rating.  "Ironically,  mean  ratings  in  the  middle 
of  a scale  serve  primarily  to  indicate  that  listeners  disagreed"  (1605).  Scale 
specificity  or  resolution  influences  interraters'  judgments.  "If  the  quality  being 
rated  is  multidimensional  in  nature  but  is  rated  on  a unidimensional  scale, 
listeners  may  selectively  focus  on  one  dimension  or  anotherj...]"  (Kreiman  et  al., 
1993:32).  Along  the  same  line,  Schiavetti  (1992)  highlights  that  different  scales 
yield  different  results  and  may  be  more  appropriate  for  some  tokens  than  others. 

Finally,  concerning  the  linguistic  background  and  level  of  experience 
(rater’s  qualifications),  Kreiman  et  al.  (1993)  suggest  that  practitioners  seem  to 
develop  a built-in  auditory  'palette'  that  enables  subtle  judgments  that  the  naive 
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listeners  fail  to  have.  The  perceptual  strategies  used  by  either  the  naive  or  the 
expert  and  experienced  listeners  vary  widely,  supporting  the  belief  that  individual 
standards  are  inherently  unstable,  hence  unreliable. 

Auditory-perceptual  rating  in  Second  Language  Acquisition  (SLA) 

In  SLA,  there  are  two  different  sets  of  listeners:  (1)  the  native  speakers 
(LI  NS)  assessing  the  production  of  non-native  speakers  (NNS);  and  (2)  non- 
native L2  speakers  assessing  the  sounds  of  the  target  language  (L2),  e.g., 
French  NSs  assessing  /y/-/u/  produced  by  English  NSs,  and  conversely,  English 
NSs  perceiving  French  /y/-/u/  contrast.  In  the  present  study,  speech  assessment 
is  limited  to  NSs  evaluating  NNS  production  of  the  target  language,  French. 

In  a typical  classroom,  the  assessment  of  L2  pronunciation  by  formal 
raters  is  viewed  holistically,  that  is  as  part  of  the  speaker's  overall  level  of  oral 
proficiency  (Bachman,  1990;  Celce-Murcia  et  al.,  1997;  Cohen,  1994;  Fradd  & 
McGee,  1994).  In  such  instances,  Flege  and  Fletcher  (1992)  claim  that  foreign 
accent  ratings  are  neither  reliable  nor  valid. 

On  one  hand,  the  teacher’s  evaluation  may  be  influenced  by  factors 
independent  from  pronunciation  per  se,  such  as  grammatical  inaccuracy,  lexical 
levels,  pauses,  length  of  utterances,  and  fluency  (Cohen,  1994;  Munro  & 

Derwing,  1998).  Furthermore,  teachers  may  be  influenced  by  their  own  linguistic 
experience,  i.e. , they  may  not  be  native  speakers  of  the  target  language,  which 
underscores  the  conspicuous  unreliability  of  the  template  to  which  the  speech 
stimuli  is  applied. 
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On  the  other  hand,  SLA  theory  questions  the  validity  of  testing 
pronunciation  at  all.  The  rationale  is  that  utterances  singled  out  as  testing 
material  do  not  reflect  the  speaker’s  ability  to  perform;  indeed,  based  on  the 
variability  hypothesis,  i.e. , intra-speaker  variation,  speakers  never  produce  twice 
the  same  utterance.  In  other  words,  the  same  speech  stimuli  may  not  be 
repeated  in  an  identical  way  if  the  test  were  taken  a second  time  (Ellis,  1994). 

In  addition,  the  tasks  may  vary  from  mimicry  with  or  without  script,  to 
spontaneous  speech,  or  reading,  each  of  them  being  a different  challenge  for 
the  listener-teacher-rater.  The  speech  stimuli  may  be  assessed  directly  during 
oral  interviews  between  raters  and  speaker-learner  or  from  audio-recordings.  In 
both  situations,  extraneous  variables  like,  noise  interference  and  equipment 
quality,  are  traditionally  not  tightly  controlled.  Scales,  such  as  the  Proficiency 
Guidelines  published  by  the  American  Council  on  the  Teaching  of  Foreign 
Languages  (ACTFL)  or  the  Cambridge  Assessment  of  Spoken  English  (CASE), 
have  been  judged  inadequate.  Both  are  “based  on  an  untenable  theory  of 
development  of  proficiency  in  a second  language  and  on  mistaken  intuitions 
about  oral  interaction  rather  than  on  empirical  research”  (Young,  1995:  13). 

In  SLA  research,  contrary  to  the  classroom  setting,  studies  show  that 
analysts  traditionally  follow  the  methodology  described  above:  tightly  controlled 
stimuli,  equipment  quality,  and  point-scales  (see  Appendix  A).  However,  the 
literature  claims  that  judgments  suffer  from  a "lack  of  an  objective  means  for 
gauging  the  degree  of  perceived  cross-language  phonetic  distance"  (Flege, 
1995:264).  The  formatting  of  the  speech  stimuli  also  affects  rating.  Flege  and 
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Fletcher  (1991 ) claim  that  the  higher  the  proportion  of  native  stimuli  in  a set  of 
sentences  being  evaluated,  the  more  'strongly  accented'  listeners  judge  the  NNS 
utterances,  and  vice  versa.  In  other  words,  the  listeners’  internal  standards  shift 
and  recalibrate  unconsciously  according  to  the  stimuli  presented.  Concerning 
the  raters'  qualification.  Flege  (1984)  found  that  both  phonetically  trained 
listeners  and  unsophisticated  listeners  are  able  to  detect  foreign  accents,  from 
reading  tasks  as  well  as  short  spontaneous  speech.  In  other  words,  listeners, 
regardless  of  their  training  in  SLA,  develop  detailed  phonetic  category 
prototypes  against  which  they  compare  and  evaluate  the  non-native  speech 
sounds.  However,  specific  recognition  and  identification  of  the  features  that 
constitute  the  correct  accent,  e.g.,  German  vs.  Dutch-accented  English,  require 
training  especially  when  longer  stretches  of  speech  are  under  investigation. 
Cross-linguistic  perception  may  be  influenced  by  factors  such  as  LI  language- 
specifics,  rate  of  occurrence,  speaking  rate,  and  context  (Flege,  1995;  Rochet, 
1995). 

Instrumental  Analyses  of  Speech  Signals 

Instrumentation  consists  of  interpreting  the  speech  signals  numerically. 
Such  a translation  requires  that  a system  can  detect  phenoma  such  as 
soundwaves,  airflow  or  pressure;  convert  them  into  an  electric  signal,  and 
manipulate  the  signal  in  order  to  provide  and  display  the  information  necessary 
for  investigation  (Baken,  1987).  Three  types  of  instrumentation  are  available  to 
assess  speech.  Acoustic  analyses  focus  on  measuring  parameters  like 
fundamental  frequency,  sound  pressure  level,  and  duration.  Aerodynamic 
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analyses  include  the  quantification  of  variation  in  airflow  and  pressure  during 
speech  production.  Kinematic  analyses  include  measuring  movements  of 
articulators  during  respiratory  function.  This  study  focuses  only  on  acoustic 
analysis. 

Acoustic  analysis 

'Acoustic  measures  are  meaningful  primarily  to  the  extent  that  they 
correspond  to  what  listeners  hear"  (Kreiman  & Gerratt,  1998:  1598).  Acoustics 
investigates  variations  occurring  in  waveforms  as  the  result  of  the  continuously 
interacting  sub-laryngeal,  laryngeal,  and  supra-laryngeal  systems:  no  value  is 
ever  absolute  (Baken,  1987;  Catford,  1988;  Kent,  1992).  There  are  three  types 
of  sources  to  consider:  the  quasi-periodic  laryngeal  voicing  (the  vibrations  of  the 
vocal  folds),  typical  to  vowels  and  voiced  consonants;  sustained  aperiodic 
energy,  random  turbulent  noise  source,  created  by  a constriction,  as  for 
fricatives;  and  transient  a-periodic  source,  e.g.  the  burst  release  of  stop 
consonants.  Fundamental  frequency  (Fo),  formant  frequencies,  duration  and 
intensity  are  the  most  frequently  used  measures  in  acoustic  analyses.  According 
to  the  source-filter  theory,  acoustic  analyses  aim  at  measuring  both  the  sound 
source  and  the  filter  or  transfer  function  (oral  and  nasal  cavities)  of  the  speech 
system. 

The  rate  of  the  vocal  fold  vibrations  is  the  source;  it  determines  the 
fundamental  frequency  (F0)  of  the  soundwave  and  the  harmonics.  Harmonics  are 
integer  multiples  of  the  fundamental  vibratory  frequency.  F0  is  measured  in 
cycles  per  second,  or  Hertz  (Hz).  The  vocal  tract  carries  the  filter  or  transfer 
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function.  In  other  words,  while  travelling  through  the  pharynx  and  the  mouth,  the 
soundwave  traces  the  length,  shape  and  size  of  the  resonating  cavities  of  the 
vocal  tract  and  follows  the  opening  or  closing  of  the  velo-pharyngeal  port,  gate 
to  the  nasal  cavity.  The  resonances  resulting  from  the  interaction  between  the 
soundwave  and  the  vocal  tract  are  called  formant  frequencies,  which  correspond 
to  areas  of  maximized  energy  relative  to  adjacent  areas.  Anti-resonances,  also 
called  anti-formants,  introduced  by  consonants,  typically  short-circuit  energy. 
Although  the  human  ear  can  register  sounds  ranging  between  20  to  20,000  Hz, 
studies  have  shown  that  the  first  5000  Hz  suffice  for  acoustic  analyses  of  natural 
human  speech.  Of  the  first  three  formants,  F2  transition  is  the  most  important 
carrier  of  linguistic  information. 

Vowels  are  measured  according  to  formant  patterns,  among  which  the 
first  three  are  the  most  important.  They  correlate  with  articulatory  events  and 
provide  information  as  to  physiological  parameters  such  as  lip  spreading, 
rounding/  protrusion,  tongue  height  and  degree  of  advancement-retraction  inside 
the  oral  cavity,  incisor  separation/mandible  depression,  and  vowel  tenseness- 
laxness.  For  instance,  the  FI  value  for  stressed  [a]  is  high  (730Hz-850Hz)  as  a 
result  of  a wide  incisor  separation.  However,  in  an  English  unstressed  [a],  FI 
decreases  as  unstressed  [a]  is  neutralized  to  a schwa  and  as  the  incisor 
separation  decreases.  The  ratio  between  the  formants  is  crucial  in  measuring 
vowel  quality.  For  example,  French  [y,u]  have  very  similar  FI  values:  240Hz- 
466Hz  and  240Hz-398Hz,  respectively.  However,  they  vastly  differ  in  their  F2 
values:  1675Hz-2750Hz  and  650Hz-1387Hz,  respectively.  While  FI  remains  a 
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constant,  the  F2  - FI  acoustic  space,  i.e. , the  difference  between  the  two 
values,  defines  the  contrast  between  the  two  vowels  (Argod-Dutard,  1996; 
Delattre,  1966).  In  addition  to  formant  values,  measures  of  stops  also  provide 
information  about  the  preceding  vowels,  e.g.,  the  longer  the  closure  duration 
(voiceless  stops)  the  shorter  the  preceding  vowels  as  in  the  [ae]  in  [hast]  which  is 
shorter  than  the  [ae]  in  [haed]  (Catford,  1988;  Denes  & Pinson,  1993;  Kent  & 
Read,  1992). 

Consonants  are  classified  according  to  their  point  of  articulation  (labial, 
dental,  alveolar,  velar,  etc.),  manner  of  production  (plosive,  nasal,  fricative,  etc), 
and  voicing  e.g.,  voiced  /bdg/  vs.  voiceless  /ptk/.  Sets  of  acoustic  cues  are 
available  to  measure  consonants.  For  instance,  stop  or  plosive  consonants  can 
be  measured  using  the  following  acoustic  cues:  (1)  closure  duration;  (2)  release 
burst,  i.e.,  the  transient  energy  matching  the  end  of  the  constriction;  (3)  "voice 
bar,”  i.e.,  a low-frequency  bar  showing  the  F0;  (4)  Voice  Onset  Time  (VOT), 
showing  the  duration  between  the  spike  of  the  release  and  the  beginning  of 
voicing  of  the  adjacent  vowel;  (5)  F2-F3  transitions  (extension  and  rate);  (6) 
spectral  features.  Among  these  procedures,  VOT  is  most  commonly  used  to 
measure  aspiration  in  initial  [p,t,k]. 

To  measure  glides  [j,w],  the  duration  of  F2  transitions  is  commonly  used; 
for  example,  [ui]  F2  transition  measures  are  longer  than  [wi]  (Kent  & Read, 

1992). 

Intensity,  measured  in  decibels  (dB),  is  the  physical  parameter 
corresponding  to  loudness  although  the  relationship  between  the  two  is  complex. 
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Speech  intensity  usually  oscillates  around  60-70  dB.  Intensity  measures 
characterize  not  only  tense  vs.  lax  vowels  and  physiological  behavior,  but  also 
stress.  Due  to  idiosyncratic  variability,  instrumentation  on  intensity  requires  strict 
calibration  to  ensure  the  reliability  of  the  measurement  (Baken,  1987;  Denes  & 
Pinson,  1993). 

Duration  used  to  measure  speaking  rate,  stress,  and  tenseness  of  vowels, 
is  considered  as  a very  reliable  and  consistent  instrument  of  measurement 
(Argod-Dutard,  1996;  Delattre,  1966).  However,  Kent  and  Read  (1992)  caution 
not  to  take  “duration  measurements  at  face  value”  (63)  in  view  of  instrumentation 
potential  difficulties,  the  main  one  being  the  difficulty  in  deciding  exactly  where  a 
segment  begins  and  ends. 

Laboratory  acoustic  analyses. 

Modern  technology  processes  the  speech  signals  by  digitization,  and 
computer  software  facilitates  the  storage  of  data  and  the  analyses  via 
quantitative  algorithms.  Here  follows  a succinct  review  of  the  acoustic  analyses 
most  commonly  used  in  cross-linguistic  studies.  The  waveform  display  allows  for 
segmentation  and  editing  of  the  signals.  The  spectrogram  (SPG)  displays 
frequency  x time  x intensity,  highlighted  by  darker  areas,  and  can  be  analyzed 
simultaneously  with  the  waveform;  formant  displays  (FMT)  can  also  be  overlaid. 
Two  different  filter  bandwidths  are  used  depending  on  the  objective  of  the 
analyses.  The  narrowband  filter  (45Hz)  provides  clear  information  about  F0 
variations  and  the  wideband  filter  (300Hz)  provides  information  about  formant 
frequency  variations  and  transitions.  Spectra  display  intensity  x frequency 
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information  of  the  speech  signals.  Spectral  analyses  refer  mainly  to  fast  Fourier 
transform  (FFT)  and  linear  predictive  coding  (LPC)  spectra.  The  FFT  spectrum 
visually  represents  the  contour  of  F0  and  the  harmonics  together  with  the  peaks 
and  valleys  corresponding  to  resonances.  LPC  analyses  display  spectral 
envelopes  ideal  for  measuring  resonance  frequencies  (Kent  & Read,  1992). 

Advantages  and  disadvantages 

Baken  (198:2)  states  that  "observation  and  measurement  of  speech  [...] 
offer  significant  advantages  over  unaided  perceptual  judgments”,  e.g.,  a more 
precise  diagnosis,  and  more  appropriate  identification  and  documentation  of 
treatment.  In  addition,  technology  now  makes  measurement  methodology  and 
training  cost  effective  (Kent,  1996;  Kent  & read,  1992).  However,  blind  reliance 
on  acoustic  measurements  should  be  avoided.  Baken  (1987:315)  cautions  that 
"a  given  acoustic  result  can  usually  be  produced  by  different  combinations  of 
vocal  actions."  Similarly,  Kent  (1993:1 10)  advises  that  "no  single  method 
satisfies  every  purpose."  So,  values  are  not  absolute  and  although  technology 
provides  new  opportunities,  instrumentation  is  not  a panacea. 

The  main  problem  is  that  spectrograms  are  intricate  to  'read'  (Borden  & 
Harris,  1984;  Kent,  1992).  Coarticulation  provides  a host  of  different  cues  either 
simultaneously  or  over  time,  i.e.,  every  sound  carries  information  about  adjacent 
ones.  Basically,  coarticulation  is  the  reason  why  there  is  no  acoustic  alphabet: 
there  is  no  one-to-one  relationship  between  the  sound  and  the  phonemes. 
Consequently,  while  coarticulation  is  critical  for  auditory-perception,  it  also  is  the 
cause  of  difficulties  in  the  segmentation  and  measurement  of  speech  signals. 
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For  example,  the  F2  frequencies  of  [u]  will  vary  contingent  upon  the  F2 
frequency  of  the  preceding  consonant  [k]/[g]  (F2  of  2350  Hz)  and  [t]/[d]  (F2  of 
1800  Hz)  vs.  [p]/[b]  (F2  of  800  Hz)  or  [I]  (F2  of  875  Hz).  In  other  words, 
coarticulation  together  with  syntactic  structures  and  prosodic  patterns  conspire 
to  render  complex  and  precarious  any  acoustic  analysis  by  spectrograms, 
starting  with  segmentation 

Identification  of  vowels  in  spectral  features  may  be  also  be  influenced  by 
factors  such  as  speaker's  fundamental  frequency  (F0),  phonetic  context,  stress, 
and  speaking  rate  (Baken,  1987).  The  reading  of  harmonic  spacing  is  a direct 
correlate  of  F0.  This  becomes  critical  when  analyzing  female  voices  because  the 
contrast  between  peaks  of  maximized  energy  and  valleys  in  the  spectrum  is 
diluted.  “As  fundamental  frequency  increases,  there  is  a corresponding  increase 
in  the  interval  between  harmonics  of  the  laryngeal  source  spectrum.  At  some 
harmonic  spacing,  it  becomes  difficult  to  discern  the  location  of  formants  in  the 
spectrum”  (Kent  & Read,  1992:  156).  For  instance,  if  a man’s  F0  is  120  Hz,  the 
source  spectrum  will  display  lines  at  the  frequencies  of  120,  240,  360  Hz  etc.  If  a 
woman’s  F0  is  225  Hz,  the  source  spectrum  will  display  lines  at  the  frequencies 
of  225,  450,  675  Hz  and  so  on. 

Tasks  may  also  affect  the  measures.  Sapienza  & Stathopoulos  (1995) 
observed  statistical  differences  while  using  both  acoustic  and  aerodynamic 
analyses  to  investigate  speech  in  three  different  tasks:  sustained  vowel 
phonation  [a],  repetition,  and  reading  of  connected  speech  with  high  frequency 
occurrence  of  [a]  in  CVCV  context  [papa]. 
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Next,  equipment  and  systems  also  constitute  a problem.  The  "[...] 
adequacy  of  the  measuring  system's  frequency  response"  (Baken,  1987:257)  is 
essential.  The  types  of  recording  systems  may  influence  the  fidelity  and 
accuracy  of  the  signals:  digital  audiotape  (DAT)  vs.  direct-to-computer  samples, 
microphone  specifications  such  as  directionality  (uni-  vs.  omnidirectional)  and 
distance  from  the  speaker's  lips,  and  consistency  in  the  input  volume  and  in  the 
testing  environment.  These  factors  are  important  in  perceptual  judgments  as 
well,  but  even  more  so  in  acoustic  testing  since  data  are  translated  into 
numbers.  Calibration,  if  not  properly  performed  (sound  level  meter,  rotameter,  U- 
tube  manometer,  etc.),  may  jeopardize  all  results  (Baken,  1987). 

Not  the  least  problem  is  the  analyst:  understanding  the  functioning  of  the 
equipment;  selecting  the  suitable  segment,  parameter  setting  and  acoustic 
analyses,  and  making  the  correct  interpretation  of  values,  require  knowledge  of 
both  the  acoustic  theory  and  algorithms  used  for  the  analyses.  For  example,  in 
spectrum  analysis,  the  analyst  must  bear  in  mind  that  "the  harmonic  frequencies 
are  determined  by  the  frequency  of  the  vibrations  of  the  vocal  folds,  whereas  the 
resonance  frequencies  are  determined  by  the  shape  and  size  of  the  cavities 
above  the  larynx"  (Catford,  1988:  160).  This  is  critical  when  setting  the 
parameters  for  LPC  analyses  of  male  vs.  female  voices.  For  example,  the  default 
LPC  and  pitch  extraction  settings  provided  by  the  Kay  Computerized  Speech 
Lab  (CSL)  are  based  on  male  voice  normative  baseline.  The  analyst  must, 
therefore,  adjust  the  settings  when  female  speech  samples  are  analyzed.  In 
addition,  as  sophisticated  as  they  may  be,  some  statistical  data  may  not  be 
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significant;  in  other  words,  the  numbers  are  only  worth  what  the  analysis  was 
planned  to  do.  Finally,  the  literature  warns  against  the  use  of  normative 
database  in  which,  until  recently,  children  and  women's  voices  were  largely 
outnumbered  by  men's  voice  samples. 

Instrumental  assessment  in  cross-language  studies 

Appendix  A displays  some  acoustic  studies  that  have  been  performed  in 
the  SLA  field  over  the  past  several  decades.  The  acoustic  cues  more  generally 
selected  in  cross-linguistic  comparison  between  English  and  Romance 
Languages  are  voice  onset  time  (VOT),  F1-F3  frequencies,  vowel  duration, 
intervocalic  consonant  duration,  and  voiced/voiceless  contrast.  For  instance, 
Flege  (1984)  measured  VOT  [t]  in  /tu/  at  46  milliseconds  by  French  NSs  vs.  78 
milliseconds  by  American  speakers.  The  literature  highlights  the  efficiency  of 
instrumental  ratings:  they  facilitate  the  use  and  manipulation  of  large  amounts  of 
data  to  identify  and  classify  areas  of  pronunciation  problems  (Argod-Dutard, 
1996;  Celce-Murcia  etal.,  1997;  Delattre,  1965-66;  Flege,  1980-95;  Molholt, 
1998;  Rochet,  1995). 

In  prosody,  the  acoustic  correlates  of  basic  prosodic  phenomena  are 
fundamental  frequency,  intensity,  and  duration  together  with  the  fact  that  they 
influence  each  other.  “Syllable  stress,  or  prominence,  in  English  is  signaled 
mainly  by  amplitude.  [...]  However,  duration  is  also  a factor.  [...]  In  fact,  duration 
is  actually  a more  consistent  cue  to  stress  than  amplitude”  (Kent  & Read,  1992: 
65).  Delattre  (1966)  and  Vaissiere  (1991)  have  highlighted  the  predominant 
effect  of  duration  over  pitch  and  intensity  variation  in  French  prosody.  However, 
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no  acoustic  measures  of  progress  in  the  acquisition  of  the  French  prosodic 
patterns  by  American  speakers  taken  in  the  classroom  setting  were  available. 

However,  Beddor  & Gottfried  (1995)  and  Flege  (1995)  underscore  that 
most  studies  use  different  methodologies,  rendering  any  generalization 
precarious.  In  addition,  if  many  studies  have  focused  on  French  contrastive 
formants  1 and  2 of  [y]/  [u],  and  VOT  of  initial  stops  (Flege,  1987;  Flege  & 
Hillenbrand,  1984;  Rochet,  1995),  no  cross-linguistic  acoustic  analysis  has  been 
performed  on  the  production  of  contrastive  [h]/[w].  Similarly,  no  literature  has 
been  found  to  document  acoustic  research  in  the  formal  setting,  be  it  at  the 
segmental  or  prosodic  level;  nor  has  any  literature  been  found  on  a comparative 
study  between  acoustic  measures  and  perceptual  ratings  in  the  foreign  language 
classroom.  In  other  words,  most  experiments  performed  for  research  purposes 
are  not  necessarily  intended  for  pedagogical  applications,  which  therefore 
isolates  the  SLA  researcher  from  the  SLA  practitioner. 

In  summary,  the  literature  shows  that  many  researchers,  clinicians, 
instructors,  and  linguists  acknowledge  the  power  and  value  of  auditory- 
perceptual  judgment,  but  also  consider  it  necessary  to  improve  the  reliability  of 
auditory  judgments  by  combining  them  with  acoustics.  Kent  (1996:  18)  suggests 
that  an  inventory  of  “perceptual-acoustic  correlates  would  be  a helpful  guide  in 
establishing  perceptual-acoustic  validation.”  He  synthesizes  the  acoustic- 
perceptual  relation  by  setting  them  at  two  opposite  poles:  " At  one  pole  is  the 
many-to-one  relation,  in  which  several  acoustic  variables  are  associated  with  a 
single  perceptual  attribute  [...].  At  the  other  pole  is  a one-to-one  (or  few-to-one) 
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relation,  in  which  a perceptual  decision  can  be  mapped  against  a single  acoustic 
dimension”  (Kent,  1996:  19).  SLA  specialists  advocate  more  research  in  the 
quantitative  specification  of  the  acoustic  properties  of  phones.  They  underscore 
the  need  for  a detailed  acoustic/articulatory  analysis  of  L2  and  LI  segments  in 
various  contexts  in  order  to  construct  the  basis  for  categorization  formats,  i.e.,  to 
measure  the  'phonetic  space'  between  L2  phones  and  the  LI  phonetic  map 
(Flege,  1995;  Kuhl,  1995;  Leather  & James,  1996;  Rochet,  1995).  Flege  and 
Hillenbrand  (1984:199)  state  that "...  sound  of  a foreign  language  can  be 
objectively  assessed  in  a variety  of  ways:  (1 ) through  the  use  of  rating  scales 
judgments  by  native  speakers  of  the  target  language,  (2)  by  calculating  the 
frequency  with  which  L2  phones  are  correctly  identified,  and  (3)  through  acoustic 
analyses."  It  appears  that  the  combination  of  the  two  methods  is  an  endorsed 
approach  to  assist  researchers  and  practitioners  in  their  attempt  to  evaluate 
foreign  accented  speech. 

Summary 

The  present  review  of  the  literature  indicates  that  research  on  the 
assessment  of  speech  and  more  specifically  of  foreign  accented  pronunciation  of 
adults  has  generated  number  of  studies  over  the  last  decades,  starting  with 
seminal  works  by  Rousselot  (1902)  and  Delattre  (1949-66).  However,  many 
specific  areas  still  need  to  be  investigated. 

Few  studies  have  targeted  either  female  adult  learners  or  the  formal 
classroom  setting.  Longitudinal  studies  to  assess  long-term  retention  in  formal 
instruction  are  sparse.  Scientifically  controlled  empirical  analyses  on  how 
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technology  affects  learning  in  the  classroom  are  few,  and  the  statistical  evidence 
of  a correlation  between  auditory-perception  and  acoustic  measurement  of 
speech  in  the  classroom  has  not  yet  been  demonstrated  (Ellis,  1994;  Flege, 

1995;  Leather  & James,  1996;  Celce-Murcia  et  al. , 1997). 

As  a consequence,  the  purpose  of  this  study  is  to  investigate  the  foreign 
accented  speech  of  American  young  women  learning  French  from  both 
perspectives:  auditory-perceptual  and  acoustical.  It  aims  at  answering  the 
following  experimental  questions  repeated  here  for  convenience: 


• Can  improvement  in  pronunciation  over  formal  training-time  be  acoustically 
quantified  relative  to  the  native  (NS)  target  baseline? 

• Can  the  effect  of  technology  (Visipitch  II)  in  formal  pronunciation  training  be 
acoustically  quantified? 

• Can  long-term  retention  in  foreign  pronunciation  be  quantified? 

• Do  the  selected  acoustic  cues--F1  and  F2  of  [y]/[u],  F2  transition  in  [wi]  and 
[Hi],  total  duration  of  utterance,  voice  onset  time  of  initial  [t]  and  [p],  syllable 
duration,  fundamental  frequency  variation,  and  FI  of  unstressed  [a]~ 
correlate  with  auditory-perceptual  ratings? 


CHAPTER  3 

RESEARCH  METHODOLOGY 


The  purpose  of  this  study  is  to  compare  the  results  of  two  methods  of 
assessing  improvement  of  pronunciation  in  the  speech  of  adult  American  young 
women  learning  French  in  the  classroom  setting.  The  first  method  consisted  of 
acoustic  measurements  while  the  second  consisted  of  auditory-perceptual  ratings  of 
the  same  selected  speech  samples.  The  study  was  designed  to  investigate  (1 ) the 
outcome  of  formal  training  in  French  pronunciation  according  to  three  sets  of 
treatment  circumstances:  traditional  audio-tapes,  audio-visual  software  (Visipitch  II, 
Kay  Elemetrics),  and  long-term  retention,  and  (2)  the  correlation  between  acoustic 
measures  and  perceptual  ratings. 

The  dependent  variables  are  both  the  auditory-perceptual  ratings  and  the 
acoustic  measures  applied  to  pre-  and  posttest  samples.  The  course  content,  the 
speakers’  profile,  and  the  consistency  in  data  collection  remain  constant. 

Data  Collection 

Speakers 

The  speech  samples  were  elicited  from  forty-five  female  subjects.  To  ensure 
homogeneity  among  speakers,  the  requirements  for  both  the  native  and  the  non- 
native groups  included: 

1 . Females  between  1 8 and  45  years  of  age 

2.  No  history  of  speech  or  hearing  disorder 

3.  Similar  academic  contexts 
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There  were  fifteen  native  speakers  of  French  (NS)  to  provide  the  normative 
baseline.  The  thirty  non-native  speakers  of  French  (NNS)  were  divided  into  two 
groups:  twenty  American-English  speakers  learning  French  and  ten  NNSs 
experienced  speakers  of  French  who  formed  the  control  group;  these  NNSs  were 
French  Teaching  Assistants.  The  twenty  NNSs  learners  of  French  were  further 
divided  into  two  groups:  ten  students  followed  the  visual  biofeedback  treatment 
using  the  Visipitch  II  (VP)  and  ten  followed  the  traditional  auditory  feedback  practice; 
this  is  the  non-Visipitch  group  (NVP).  The  longitudinal  group  consisted  of  nine 
volunteer  students— VP  and  NVP-  from  the  group  of  twenty  NNSs  learners  of 
French. 

Treatment  circumstances.  The  twenty  NNS  learners  had  all  attained  a similar 
level  of  competence  in  their  average  academic  learning  of  French,  i.e.,  third  year 
French.  The  nine  NNSs  in  the  longitudinal  group,  having  enrolled  in  the  French 
program,  were  expected  to  pursue  their  studies  of  French  beyond  the  pronunciation 
course  they  had  taken.  This  further  learning  was  not  viewed  as  an  extraneous 
variable,  but  rather  as  a normal  component  of  progress  in  L2  speech. 

The  twenty  speakers  in  the  VP,  NVP  and  longitudinal  groups  attended  two 
sections  of  a sixteen-lesson  course  on  Corrective  French  Phonetics,  given  by  one 
professor  at  the  University  of  Florida,  fall  1996.  The  students  met  twice  a week  for  a 
50-minute  period  to  work  in  the  UF  audio-language  laboratory  on  prerecorded 
materials  with  a printed  manual.  They  were  divided  into  two  groups.  In  the  first 
group,  the  learners  completed  the  course,  working  primarily  on  auditory  feedback 
and  tutoring  from  the  instructor.  The  learners  from  the  VP  group  followed  the  same 
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formal  instruction.  However,  at  the  end  of  each  lesson,  they  practiced  five  typical 
sentences  using  the  Visipitch  II  software  (Kay  Elemetrics),  which  provides  immediate 
auditory  and  visual  biofeedback.  This  practice  was  limited  to  ten  minutes  per  lesson, 
per  student,  i.e.,  160  minutes  per  student  over  the  course.  The  experienced  NN 
speakers  of  French  of  the  control  group  were  not  provided  with  any  instruction  on 
pronunciation. 

Speech  Stimuli  - Elicitation  task 

All  speakers  were  asked  to  read  a simulated  real-life  dialogue  (Appendix  B). 
This  type  of  elicitation  task  was  preferred  over  spontaneous  running  speech  to 
guarantee  an  identical  linguistic  context  necessary  for  the  reliability  and  replicability 
of  the  study.  The  text  presented  no  particular  problem  as  to  lexical  or  grammatical 
unfamiliarity;  nor  was  there  any  orthographic  complexity,  i.e.,  sound/spelling 
contrasts. 

For  the  NNSs,  the  elicitation  task  was  required  twice:  once,  prior  to 
instruction,  and  again  upon  completion  of  the  course.  The  control  group  was  asked 
to  take  the  same  pre-  and  posttest  with  a time  interval  corresponding  to  a typical 
semester-course.  The  nine  speakers  of  the  longitudinal  group  were  asked  to 
produce  the  same  dialogue  a third  time  during  the  fall  of  1 997.  For  thirteen  out  of  the 
fifteen  NSs,  two  productions  were  recorded  from  each  speaker  in  order  to  provide 
for  intra-speaker  variability;  the  total  number  of  NS  samples  was,  therefore,  28. 

The  samples  were  recorded  on  audio-tapes  in  a sound-proof  room  and  with 
optimum  equipment  (Sharp  cassette-deck  tape  recorder  and  Shure  unidirectional 
dynamic  microphone,  Model  SM48)  provided  by  the  Institute  for  Advanced  Study  of 
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the  Communication  Processes  at  the  University  of  Florida  (UF),  Gainesville,  USA. 
Each  recording  session  was  preceded  by  a five-minute  casual  conversation 
designed  to  lower  the  anxiety  level  for  the  speaker  and  enable  the  analyst  to  set  the 
input  or  volume  at  the  appropriate  level  and  avoid  overloaded  signals. 

Data  Preparation 

From  the  dialogue  (See  Appendix  B),  the  following  samples  was  extracted: 

• segmentals:  tu  [ty]  'you'  vs.  tout  [tu]  'everything'; 

suite  [sqit]  vs.  oui[ wi]  'yes'; 

• suprasegmentals:  fais  attention  de  ne  pas  glisser  [fgzatasjodcenoepaglise] 

'be  careful  not  to  slip' 

The  speech  samples  were  selected  based  on  the  feasibility  of  the  present 
study  (time  and  scope),  the  predictability  of  speech  errors  based  on  phonemic  vs. 
allophonic  distribution/variations  and  prosodic  patterns  (See  Chapter  2). 

The  digitization,  segmentation,  storing,  and  measuring  of  the  stimuli  were 
performed  using  the  Kay  Computerized  Speech  Lab  (CSL-Kay  Audio  processing 
Package  -Model  4300).  The  selected  tokens  were  captured  and  digitized  at  10-kHz 
sampling  rate.  This  rate  automatically  sets  the  low-pass  filter  to  a cutoff  frequency  of 
4kHz,  which  is  considered  the  most  appropriate  level  for  typical  speech,  i.e., 
between  2 x to  2.5  x the  highest  expected  frequency  of  the  signal  to  be  measured 
(Kent,  1992;  CSL  Manual).  Once  digitized,  the  speech  samples  were  segmented, 
edited,  and  stored  for  further  acoustic  analyses. 

Acoustic  Analyses 

Consistent  with  acoustic  cross-linguistic  studies  (See  Appendix  A), 
recommendations  from  the  review  of  the  literature,  and  the  CSL  Manual,  45  acoustic 
parameters  were  tested  on  both  the  segments  and  the  suprasegmentals  as  part  of  a 
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pilot  study  on  reliability.  Several  measurements  had  to  be  discarded  due  to 
unreliability.  For  instance,  intensity  was  not  measured  due  to  its  complexity,  which 
has  been  underlined  in  the  literature  (Baken,  1987;  Kent  & Read,  1992). 

/ 

Although  the  input  or  volume  level  was  controlled  for  each  speaker,  at  pre-  and 
posttest,  to  control  the  intensity  variation  inherent  to  natural  speech  was  not 
feasible.  Similarly,  nasality  in  [a]  and  [6]  -although  there  is  anecdotal  evidence 
of  their  saliency  as  cues  to  foreign  accentedness-  was  not  measured  due  to  the 
complexity  of  this  particular  issue  in  research  (Baken,  1987;  Cesar-Lee,  1995; 
Kent  & Read,  1992). 

As  a result,  the  following  fifteen  measurements  were  selected. 

Acoustic  Measures  for  the  Quantification  of  Segments:  [y]  / [u]  and  [Mi]  /[ wi] 

The  articulatory  features  of  the  respective  segments,  i.e.,  the  shaping  of  the 
vocal  tract  during  production,  translate  into  the  following  acoustic  correlates.  The  FI 
frequencies  are  inversely  proportionate  to  tongue  height:  the  higher  the  tongue,  the 
lower  the  FI . Similarly,  the  less  the  incisor  separation,  i.e.,  the  smaller  the  aperture 
or  the  higher  the  mandible,  the  lower  the  FI . The  F2  frequencies  are  directly 
proportionate  to  tongue  forward  movement:  F2  frequency  increases  as  the  tongue 
dorsum  moves  forward,  and  conversely  decreases  when  the  tongue  dorsum  moves 
backward.  Lip  rounding  causes  all  formants  frequencies  to  lower,  but  especially  F2 
(Delattre,  1966;  Kent  and  Read,  1992).  Catford  (1988)  also  underscores  the 
difference  between  ‘endolabial’  and  ‘exolabial’  rounding  (See  Chapter  2 under 
“Articulatory  Phonetics”).  The  dominant  contrast  consists  of  the  front  vs.  back 
position  of  the  highest  point  of  the  tongue.  As  a result,  the  acoustic  values  selected 
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for  [y]  and  [u]  was  the  FI  and  F2  frequencies,  measured  at  20  milliseconds  from 
voice  onset  in  order  to  avoid  the  effect  of  coarticulation  with  adjacent  segments.  The 
American-English  speakers  learning  French  were  expected  to  produce  the  French 
[y]  as  an  English  [u],  i.e.,  with  lower  F2  frequency  values. 

The  acoustic  measure  selected  for  the  diphthongs  [Mi]  and  [wi]  was  the  F2 
frequency  transition  between  the  onset  of  each  glide.  Traditionally,  the  F2  duration 
transition  is  used  to  measure  contrast  [ui]  vs  [wi],  i.e.,  from  Vowel+Vowel 
->Semivowel+vowel.  The  present  research  focused  on  two  different  semivowels 
followed  by  the  same  full  high  vowel.  Since  the  major  contrast  between  the  two 
diphthongs  stems  from  their  respective  initial  related  vowels  palatal  [y]  vs.  velar  [u], 
the  vowel  [i]  in  the  nucleus  being  a constant  feature  for  both,  F2  frequency  transition 
was  selected.  The  learners  were  expected  to  produce  the  semivowel  [h]  as  [w]. 

Parameters  Selected  for  the  Quantification  of  Prosody 

As  described  in  the  literature,  the  acoustic  correlates  of  prosody  are 
fundamental  fequency  variation,  intensity,  and  duration  (Delattre,  1966;  Kent  & 
Read,  1992;  Vaissieres,  1991).  The  pilot  study  suggested  discarding  measures 
of  intensity  in  the  present  context.  Consequently,  in  the  selected  phrase  fais 
attention  de  ne  pas  alisser  ffp.zatasiodcenoepaalise]  'be  careful  not  to  slip', 
duration,  F0  and  formant  frequency  variations  were  measured  as  acoustic 
correlates  to  determine  the  saliency  of  cues  to  infelicitous  prosody. 

Total  duration  of  utterance  (TDU)  was  selected  to  investigate  progress  in 
speaking  rate.  The  NNSs  were  expected  to  decrease  their  TDU  between  pre- 
and  posttest. 
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Voice  Onset  Time  (VOT)  for  [p]  vs.  [ph]  in  pas  and  [t]  vs.  [th  ] in  attention 
measured  the  duration  between  the  burst  release  (spike)  and  the  initial  voicing, 
e g.,  vocal  fold  vibrations,  of  the  adjacent  vowel  as  illustrated  in  Figure  3.1.  The 
literature  showed  that  NNSs  are  expected  to  produce  these  initial  stops  with 
long-lag  VOT  whereas  French  native  speakers  will  produce  a short-lag  VOT 
(Flege,  1987).  NNS  improvement  between  pre-  and  posttest  was,  therefore, 
expected  to  translate  into  a shorter  VOT. 


Figure  3.1 

Illustrations  of  spectrograms  of  (A)  aspirated  and  (B)  unaspirated  stops.  The  <=> 
Indicates  the  interval  measured  in  duration  between  the  ‘spike’  and  voice  onset. 
(Reproduced  from  Kent  & Read,  1992:  107). 

Formant  1 of  unstressed  fal  in  attention.  Studies  in  French  indicate  that 

[a]  whether  stressed  or  unstressed  has  values  ranging  from  691  Hz  to  850  Hz 

(Delattre,  1966;  Argod-Dutard,  1996).  English  NSs  are  expected  to  neutralize 

the  unstressed  [a],  i.e. , to  alter  its  resonance  frequencies  from  full  vowel  to 

neutral  schwa,  lowering,  therefore,  the  FI  values.  NNS  production  at  the  prettest 
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was  expected  to  be  lower  than  NS  output,  with  improvement  translating  into  a FI 
[a]  increase. 

Fo  variation  and  syllable  duration  as  cues  to  stress  patterns.  Considering 
that  English  is  labeled  as  stress-timed,  i.e. , the  rhythm  is  grounded  on  stress 
assignment,  while  French  is  labeled  as  syllable-timed,  i.e.,  the  rhythm  is 
grounded  on  temporal  organization,  the  present  acoustic  analyses  investigated 
both  sets  of  measures.  More  specifically,  unstressed  [ta]  in  attention , stressed 
[pa]  in  pas,  and  stressed  [se]  in  glisser  were  examined  in  terms  of  duration  and 
fundamental  frequency  variation  relative  to  their  respective  adjacent  syllables. 
The  learners  were  expected  to  assign  an  infelicitous  lexical  stress  on  [ta]  as  well 
as  on  [gli]  preceding  [se], 

A total  of  15  measures  were,  therefore,  considered  as  dependent  acoustic 
variables.  They  are  summarized  hereafter: 

1.  FI  of  [y]  in  tu 

2.  F2  of  [y]  in  tu 

3.  FI  of  [u]  in  tout 

4.  F2  of  [u]  in  tout 

5.  F2  frequency  transition  of  [qi] 

6.  F2  frequency  transition  of  [wi] 

7.  Total  duration  of  utterance 

8.  FI  frequency  of  unstressed  [a]  in  attention 

9.  VOT  of  unstressed  [t]  in  attention 

1 0.  VOT  of  stressed  [p]  in  pas 

1 1 . Syllable  duration  of  unstressed  syllable  [ta]  in  attention 

12.  Fo  variation  of  unstressed  [ta]  relative  to  the  ajacent  syllables  fais  a and  - 
tion  in  fais  attention 

1 3.  Syllable  duration  of  stressed  [pa] 

14.  Fo  variation  of  stressed  syllable  [pa]  relative  to  the  adjacent  syllables  de 
ne  and  gli-  in  de  ne  pas  glisser 

15.  Fo  variation  stressed  [se]  sser  relative  to  [gli]  in  glisser 
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Auditory  - Perceptual  Ratings 

Listeners 

The  listeners  were  six  French  NSs  academically  involved  in  the  teaching 
of  French  although  not  specifically  trained  in  the  recognition  of  phonetic 
features.  So,  although  naive  and  unsophisticated  as  phoneticians,  these  raters 
were  trained  in  listening  to  and  assessing  L2  learners.  None  of  them  reported 
any  hearing  disorder. 

Listening  Material 

For  each  speech  stimulus,  there  were  75  samples:  69  NNSs  and  6 NSs 
selected  from  the  normative  database.  Each  of  the  four  segments  and  the 
phrase  were  produced  69  times  by  the  NNSs:  20  NNSs  x 2 (pre-,  posttests)  + 10 
NNSs  Control  x 2 (pre-,  posttests)  + 9 NNSs  (longitudinal  test).  The  69  NNS 
samples  were  randomized  using  a computer  generated  system  developed  by 
Jose  Diaz  for  the  University  of  Florida  (with  MATLAB  as  software)  and 
sequentially  recorded  from  digital  back  to  an  analog  system.  For  every  10  NNSs, 
one  NS  sample  was  inserted.  The  NS  prototypical  samples  were  designed  to 
regularly  reset  the  listener's  template  and  compensate  for  possible  ‘mishearing’ 
or  habituation  to  foreign  accented  utterances.  The  listeners  were  informed  that 
the  samples  included  NNS  and  NS  samples,  but  not  specifically  at  what  rate. 
Each  set  of  recordings  was  introduced  by  clear  information  as  to  the  task  and, 
more  importantly,  by  two  exemplars  of  prototypical  NS  production,  used  as 
anchor  stimuli  to  set  the  auditory  standards  and  calibrate  the  listener's  auditory 
scales.  The  listeners  were  informed  that  these  two  exemplars  were  NS  samples. 
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The  first  set  of  speech  samples  that  the  listeners  had  to  rate  was  the 
phrase  followed  by  the  segments.  This  sequence  was  chosen  in  order  to  present 
the  task  to  the  listeners  according  to  a ranking  in  difficulty:  from  less  to  more 
difficult,  i.e.,  (1)  fais  attention  de  ne  pas  glisser  ffp.zatasiodcenoepaalisel  'be 
careful  not  to  slip1;  (2)  tu  [ty]  'you';  (3)  tout  [tu]  'everything';  (4)  oui  fwil  'yes';  and 
(5)  suit  [sqi].  The  rating  of  [ty]/[tu]  was  preferred  over  [y]  and  [u]  because  the 
listeners  of  the  pilot  study  complained  about  the  artificiality  of  rating  [y]  since  it 
does  not  belong  to  the  French  repertoire  as  a meaningful  word.  Similarly,  suit 
[sni]  was  preferred  over  suite  [snit]  because  the  release  or  absence  of  release  of 
the  final  consonant  could  have  diverted  the  attention  to  be  paid  to  and  focused 
on  the  diphthongs.  In  other  words,  the  syllable  [sqit]  that  the  listeners  had  to  rate 
was  segmented  as  [sqi]  due  to  the  inconsistency  in  the  final  [t]  production. 

Rating  Scales 

The  listeners  were  provided  with  perceptual  rating  scales  of  different 
resolutions.  For  the  phrase,  listeners  were  asked  to  rate  degree  of  foreign 
accentedness  on  a four-point  scale:  (1)  no  foreign  accent,  (2)  mild,  (3)  moderate, 
or  (4)  strong  foreign  accent.  For  the  segments,  listeners  were  asked  to  rate 
foreign  accentedness  on  a three-point  scale:  (1)  no  foreign  accent,  (2)  mild,  or 
(3)  strong  foreign  accent.  The  differentiation  between  the  two  sets  of  scales  was 
the  result  of  the  pilot  study.  Listeners  who  tested  the  material  and  the  protocol 
considered  that  a 3-point  scale  for  the  phrase  did  not  provide  sufficient  flexibility 
of  choice,  whereas  a 4-  vs.  a 3-point  scale  for  the  segments  was  confusing. 
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Statistical  Analyses 

Descriptive  statistical  computation  was  conducted  to  determine  the 
normative  baseline  (means  and  standard  deviations),  and  the  distribution 
(central  tendency  and  dispersion)  of  values  between  ratings  of  pre-  and  posttest 
results  from  the  NNS  utterances. 

Each  of  the  fifteen  acoustic  measures  selected  to  quantify  foreign 
accented  pronunciation  at  the  segmental  and  suprasegmental  levels,  was 
logged  97  times:  28  NS  samples  (See  under  “Elicitation  Task”)  + 20  NNSs  x 2 
(pre-,  posttest)  +10  NNSs  Control  x 2 (pre-,  posttest)+  9 NNSs  Longitudinal. 

The  15  variables  multiplied  by  97  stimuli  resulted  in  a total  of  1,455  measures  to 
be  analyzed  for  statistical  significance  in  conjunction  with  the  results  provided  by 
the  perceptual  part  of  the  experiment. 

For  the  auditory-perceptual  ratings,  the  five  sets  (4  segments  and  1 
phrase)  of  75  stimuli  were  to  be  judged  by  six  listeners,  yielding  a total  of  2,250 
ratings.  The  discrepancy  between  the  number  of  acoustic  measures  and  the 
number  of  auditory-perceptual  ratings  stems  from  the  fact  that  the  auditory  test 
relies  on  holistic  judgment.  In  other  words,  for  the  total  phrase  fais  attention  de 
ne  pas  glisser,  the  number  of  perceptual  ratings  is  450:  75  samples  of  the 
phrase  (69  NNSs  + 6 NSs)  x 6 raters.  However,  the  number  of  acoustic 
correlates  for  this  phrase  is  873:  97  samples  of  the  phrase  (69  NNSs  + 28  NSs) 
x 9 acoustic  parameters.  Conversely,  to  assess  [y],  there  were  194  acoustic 
values  (FI  x 97  + F2  x 97)  against  450  auditory-perceptual  ratings. 
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First,  the  statistical  significance  of  acoustic  variations  was  obtained  from  both 
the  between-subjects  and  the  within-subjects  approach.  The  variations  between 
speakers’  groups  (independent  groups)  were  examined  according  to  the  type  of 
training  or  practice  followed  by  the  NNS  in  the  group:  audio  vs.  audio-visual 
(Visipitch  II).  The  variations  within  subjects  (matched  pairs  of  same  subjects),  based 
on  pre-  posttests,  were  examined  according  to  two  temporal  parameters:  a typical 
sixteen-week  training  period  as  well  as  long-term  retention,  that  is,  when  a second 
posttest  production  was  asked  three  months  after  the  end  of  the  course. 

The  statistical  analysis  was  performed  using  SAS  software.  The  significance 
level  or  alpha  level  was  set  at  .05.  In  other  words,  for  any  p-value  <.05,  the  null 
hypothesis  cannot  be  rejected,  i.e.,  there  is  not  enough  evidence  to  support  the 
alternative  hypothesis. 

Then,  the  relationship  between  the  acoustic  measures  and  the  auditory 
ratings  was  examined.  After  defining  whether  or  not  there  is  an  association  (p<05) 
between  each  acoustic  variable  and  the  perception  of  degrees  of  foreign  accent,  the 
strength  of  this  association  was  measured,  i.e.,  the  correlation.  The  Pearson 
Correlation  Coefficient  is  expected  to  demonstrate  to  what  extent  “...an  observed 
change  in  one  variable  appears  to  be  associated  with  a concomitant  change  in 
another”  (Maxwell  & Satake,  1997:  136).  The  stronger  the  correlation,  the  closer  to 
1 .00  (+  or  - whether  the  linear  regression  is  positive  or  negative).  The  calculation  of 
the  strength  of  this  association  was  intended  to  make  statistical  inferences  and 
predictions  as  to  the  definition  and  the  resolution  of  prototypical  acoustic-phonetic 
spaces  per  categories  or  levels  of  foreign  accented  pronunciation. 


CHAPTER  4 
RESULTS 


This  study  compared  the  results  of  two  methods  used  for  assessing 
improvement  of  pronunciation  in  the  speech  of  adult  American  young  women 
learning  French  in  the  classroom  setting.  The  first  method  used  acoustic 
measurements  while  the  second  method  utilized  auditory-perceptual  ratings  of 
the  same  selected  speech  samples  analyzed  with  the  acoustic  measures.  The 
speech  samples  included  both  segmental  and  suprasegmental  features:  tu  ‘you’ 
vs.  tout  ‘everything’,  oui  ‘yes’  vs.  suite,  and  fais  attention  de  ne  pas  glisser 
‘careful  not  to  slip’.  All  speech  samples  were  audio-recordings  of  utterances 
produced  by  forty-five  female  subjects,  aged  18-42,  at  three  dates  approximately 
three  months  apart  (pretest,  posttest,  and  longitudinal  test).  Fifteen  native 
speakers  of  French  provided  the  baseline,  twenty  English  speaking  women 
learning  French  formed  the  experimental  group,  and  ten  experienced  non-native 
speakers  of  French  formed  the  control  group.  The  study  was  designed  to 
examine  the  following  independent  variable:  formal  training  in  French 
pronunciation  according  to  three  sets  of  treatment  circumstances,  i.e. , traditional 
audio-tapes,  audio-visual  software  (Visipitch  II,  Kay  Elemetrics),  and  long-term 
retention.  The  dependent  variables  were  acoustic  measures  made  from  the 
above-mentioned  samples  and  the  auditory-perceptual  ratings. 
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Fifteen  acoustic  parameters  were  selected  as  potential  cues  to  quantify 
the  progress  in  foreign  accented  pronunciation  (See  Chapter  3-Methodology). 
However,  in  the  course  of  the  research  two  sets  of  measures  were  altered:  F2 
frequency  variations  for  the  onset  of  the  semivowels  [w]  and  [m]  were  discarded. 
Formant  1 and  formant  2 frequencies  for  [y]  and  [u]  were  replaced  by  the 
difference  between  the  two  formants:  F2  - FI . The  analyses  of  F2  frequency 
variations  for  the  semivowels  [w]  in  oui  ‘yes’  and  [m]  in  suite  were  inconclusive. 
The  lack  of  homogeneity  between  their  respective  linguistic  environment 
generated  inconsistent  acoustic  measures,  as  the  broader  linguistic  contexts  of 
[wi]  vs.  [ni]  in  tout  de  suite  'right  away’  were  vastly  different,  and  made  such  a 
comparative  analysis  invalid.  F2  frequency  variation  for  the  semivowels  was, 
therefore,  discarded.  The  substitution  of  FI  and  F2  frequency  by  the  difference 
between  these  two  formants,  i.e. , F2  - FI , for  [y]  and  [u]  stems  from  the  fact  that 
F2  - FI  provided  more  stable  values  than  FI  and  F2  frequency  analyzed 
separately.  This  difference  is  actually  indicative  of  the  interaction  between  lip 
rounding  and  the  position  of  the  tongue  dorsum  posterior  or  anterior.  Therefore, 
eleven  acoustic  dependent  variables  were  eventually  statistically  analyzed. 

Six  French  native  listeners  provided  the  auditory-perceptual  ratings.  They 
were  asked  to  judge  the  segmental  speech  samples  according  to  a three-point 
scale  and  the  phrase  along  a four-point  scale  (See  Chapter  3-Methodology). 

The  statistical  significance  of  the  difference  between  the  means  of  the 
various  groups  was  tested.  Both  between  subject  and  within  group  differences 
were  statistically  analyzed.  The  between-subject  (independent  group  of 
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subjects)  factor  was  the  type  of  training:  audio  vs.  audio-visual  (Visipitch  II).  The 
within-subject  factor  (matched  pairs  of  same  subjects)  was  the  pre-,  post-testing. 
The  pre-,  posttest  scores  were  examined  according  to  two  temporal  parameters: 
(1 ) at  the  end  of  the  typical  semester  training  period  (posttest  1 ),  and  (2)  three 
months  after  the  end  of  the  course  (posttest  2). 

The  statistical  analysis  was  performed  using  SAS  software.  The  alpha 
level  was  set  at  .05.  A Pearson  correlation  analysis  was  also  completed  to 
determine  the  relationship  between  the  acoustic  variables  and  the  perception  of 
foreign  accent.  For  each  variable,  the  assumptions  of  normality  of  distribution 
and  of  equal  variances  were  verified,  so  parametric  tests  were  appropriate. 

The  results  of  this  study  are  presented  relative  to  the  research  questions. 
First,  can  improvement  in  pronunciation  with  formal  training  be  acoustically 
quantified  relative  to  the  native  (NS)  target  baseline?  Second,  can  the  effect  of 
technology  (Visipitch  II)  used  in  formal  pronunciation  training  be  acoustically 
quantified?  Third,  can  long-term  retention  in  foreign  pronunciation  be  quantified? 
And  fourth,  do  selected  acoustic  cues  correlate  with  auditory-perceptual  ratings? 
For  each  question,  data  are  presented  first  followed  by  results  of  the  statistical 
test. 

The  Quantification  of  Formal  Training 
This  part  of  the  experiment  was  designed  to  demonstrate  to  what  extent 
the  formal  training  and  practice  followed  by  20  non-native  speakers  (N=20) 
quantitatively  translate  into  progress  relative  to  the  NS  target.  Table  4.1.  shows 
the  means  and  statistical  significance  of  pre-  and  posttest  results  of  the  NNS 


group  (matched  pairs/within-subjects  study).  It  was  hypothesized  that  formal 
training  in  pronunciation  would  generate  improvement  in  the  learners’ 
pronunciation  (p<  05). 
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Table  4.1. 

Approximation  by  20  NNSs  to  the  NS  Target  as  a Result  of  Formal  Training:  Pre- 
and  Posttest  Means,  Difference  of  Means,  and  p-Values  for  Eleven  Acoustic 
Variables. 


Acoustic 

Parameters 

15  NS 
Target 

20  NNS 

Pretest 

Means 

Posttest 

Means 

Difference 

Pre-posttest 

Pr>F 

Formants  in  Hen 

tz  (Hz) 

F2-F1  [y] 

1721 

1784 

1813 

29* 

.689 

F2-F1  [u] 

1218** 

1846 

1866 

20* 

.733 

FI 

732 

589 

622 

33 

.343 

Duration  in  milliseconds  (msec.) 

Total  Duration  of 
Utterance 

1.408 

2.245 

1.867 

-378 

.0012 

VOT  [t] 

22 

44.3 

41.2 

-3.1 

.343 

VOT  [p] 

11 

31 

20 

-11 

.0008 

Unstressed  ‘tten’ 

128 

166 

159 

-7 

.129 

Stressed  'pa' 

104 

162 

136 

-26 

.0146 

Fundamental  Frequency  Variations:  Unstressed  vs.  Stressed 
Syllables  in  Hertz  (Hz) 

Unstressed  ‘tten’ 

258 

248 

243 

-5 

.442 

Stressed  ‘pa’ 

254 

244 

237 

-7* 

.150 

Stressed  ‘sseh 

279 

236 

243 

7 

.170 

Note: 

Numbers  in  bold  indicate  significance. 

The  * denotes  areas  where  learners  actually  regressed,  i.e.,  did  not  improve. 

**  Such  high  F2-F1  values  for  [u]  originate  from  the  presence  of  [t]  in  the 
immediate  vicinity  of  [uj;  the  locus  of  [t]  (1800Hz)  distorts  the  LPC  readings  and 
raises  the  formant  values  traditionally  accepted  for  [u]  in  a neutral  context  (885- 
945Hz) 

Out  of  the  eleven  acoustic  variables  investigated  a statistical  difference 
was  detected  for  three  temporal  variables:  total  duration  of  utterance  (p=,0012), 
Voice  Onset  Time  (VOT)  of  [p]  (p=. 0008),  and  syllable  duration  of  stressed  [pa]  in 
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pas  (p=. 0146).  The  20  NNSs  performed  better  in  posttest,  i.e. , approximated  the 
NS  target  after  the  formal  training.  The  effect  of  training  is  detected  by  significant 
changes  on  these  three  acoustic  parameters. 

The  Quantification  of  Visioitch  II  (VP)  vs.  Non-Visipitch  (NVP) 

This  part  of  the  experiment  was  designed  to  examine  the  effect  of 
technology,  that  is,  to  what  extent  the  pronunciation  performance  of  the  ten 
NNSs  who  used  the  Visipitch  II  (VP)  software  differs  from  that  of  the  ten  NNSs 
who  did  not  use  it  (NVP).  The  VP  group  was  expected  to  show  a greater  change 
than  the  NVP  in  the  temporal  parameters  and  in  their  pitch  variation.  The  effect  of 
technology  was  investigated  for  each  of  the  eleven  selected  acoustic  parameters. 

Statistical  significance  was  detected  for  only  one  variable:  the  fundamental 
frequency  variations  of  stressed  syllable  ‘-sser’  [se]  in  q//sser  (p=010)  (Table 
4.2).  Although  not  meeting  the  significance  level  decided,  the  p-value  of  F2  - FI 
for  [u]  (p=. 075)  indicates  that  the  NVP  group  approximated  the  target  NS  values 
of  [u]  phonetic  space  by  lowering  the  F2  - FI  values.  By  the  same  token,  the 
study  shows  that  the  VP  group  regressed  between  pre-  and  posttest.  Conversely, 
the  VP  group  performed  better  than  the  NVP  group  in  approximating  the  NS 
target  by  raising  the  pitch  for  the  syllable  ‘-sser’  [se]  to  mark  the  sentence 
continuity. 

No  results  were  available  for  ‘pas  [pa]  duration  and  [pa]  fundamental 
frequency  variation  because  the  variances  between  the  two  independent  groups 
were  unequal 
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Table  4.2 

Approximation  by  NNS-Visipitch  vs.  NNS-Non  Visipitch  to  the  NS  Target  in 
Terms  of  Means,  Difference  Between  Pre-,  Post  Means  (*)  and  P-Values 


Acoustic 

Parameters 

NS 

Target 

10VP 

Pre-Post 

Means 

+ * 

10  Non-VP 

Pre-Post 

Means 

Pr>F 

Formants  in  Hertz  (1- 

z) 

F2-F1  [y] 

1721 

1728-1849 

121 

1841-1777 

-64 

.200 

F2-F1  [ul 

1218 

1545-1811 

266 

1992-1921 

-71 

.075 

F1[al 

732 

613-626 

13 

562-617 

52 

.570 

Duration  in  milliseconds  (msec.) 

Total  Duration 
of  Utterance 

1408 

2362-1918 

-444 

2128-1816 

-312 

.535 

VOT  [t] 

22 

43-35 

-8 

46-48 

2 

.206 

VOT  [p] 

11 

25-18 

-7 

32-22 

-10 

.629 

[-stress]  ‘tten’ 

127 

171-159 

-12 

162-158 

-4 

.486 

Fundamental 
Syllables  in  H 

Frequency  Variations:  Unstressed  vs.  Stressed 
ertz  (Hz) 

[-stress]  ‘tten’ 

258 

247-241 

-6 

247-244 

-3 

.750 

[+stress]  ‘seh 

279 

229-246 

17 

244-239 

-5 

.010 

Note:  * refers  to  a significant  difference  between  the  means  obtained  from  the 
pre-  and  posttests. 


The  Quantification  of  Long-term  Retention 

This  third  experimental  question  was  designed  to  demonstrate  to  what 
extent  formal  training  in  pronunciation  is  generalized  into  situations  following  the 
training  in  pronunciation.  The  posttest  (1)  production  of  nine  NNSs  was 
compared  to  a third  production  (posttest  2)  performed  three  months  after  the  end 
of  the  course.  It  was  hypothesized  that  long-term  retention  could  be  acoustically 
quantified  (p<  05). 

The  difference  between  the  posttest  land  posttest  2 means  showed  that 
the  nine  NNSs  approximated  the  NS  target  when  progress  was  measured  in 
terms  of  the  following  cues:  all  the  temporal  values,  F2-F1  [y],  FI  of  unstressed 
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[a],  and  fundamental  frequency  of  “pas”  and  “-tten”.  Conversely,  F2  - FI  [u] 
values  exhibited  a further  deviation  in  long-term  production.  A statistically 
significant  difference  for  total  duration  of  utterance  (p=.014)  and  for  the  syllable 
duration  of  unstressed  ‘tten”  (p=.009)  (Table  4.3)  was  indicated. 


Table  4.3 

Long-term  Approximation  by  Nine  NNS  to  the  NS  Target:  Means  of  Posttest  1 
and  2,  Difference  Between  the  Means,  and  P-Values. 


Acoustic 

Parameters 

NS 

Target 

9 NNS 
Post  1 
Means 

9 NNS 
Post  2 
Means 

Post  2 - 
post  1 

Pr>F 

Formants  in  Hertz  (Hz) 

F2-F1  [y] 

1721 

1835 

1802 

-33 

.708 

F2-F1  ful 

1218 

1750 

1820 

70 

.428 

11M 

732 

622 

642 

20 

.59 

Duration  in  milliseconds  (msec.) 

Total  Duration 
of  Utterance 

1408 

1838 

1745 

-93 

.014 

VOT  [t] 

22 

30 

21 

-9.1 

.14 

VOT  [p] 

11 

13 

7 

-6 

.13 

[-stress]  ‘tten’ 

127 

145.5 

135 

-10.5 

.009 

[+stress]  ‘pa’ 

104 

146.5 

141 

-5.5 

.75 

Fundamental 
Syllables  in  H 

Frequen 
ertz  (Hz 

cy  Variations:  U 

nstressed  vs.  Stressed 

[-stress]  ‘tten’ 

258 

245 

234 

-11 

.18 

[+stress]  ‘pa’ 

254 

242 

250.3 

8.3 

.59 

[+stress]  'ser1 

279 

237.8 

237.9 

.1 

.99 

Note:  Numbers  in  bold  indicate  significance. 


Correlation  Between  Acoustic  Measures  and  Perceptual  Ratings 
The  experimental  question  as  to  the  relationship  between  acoustic 
parameters  and  perceptual  ratings  was  designed  to  examine  whether  the 
selected  acoustic  cues  across  the  different  speech  tests  of  -F2  - FI  frequency  of 
[y]  and  [u],  total  duration  of  utterance,  Voice  Onset  Time  (VOT),  syllable  duration, 
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fundamental  frequency  variation,  and  FI  frequency  of  unstressed  [a]--  correlated 
with  the  auditory-perceptual  ratings. 

The  six  French  native  listeners  rated  the  native  and  non-native 
productions  of  both  the  one-syllable  word  tu/tout  [ty,  tu]  ‘you/everything’  and  the 
phrase  fais  attention  de  ne  pas  glisser  ‘be  careful  not  to  slip’  according  to 
perceived  levels  or  categories  of  accent.  The  segments  oui/suit  had  to  be 
discarded  due  to  the  heterogeneity  of  their  respective  linguistic  environments 
(See  the  introduction  to  this  chapter). 

The  total  number  of  speech  samples  per  set  was  75:  69  non-native 
speakers  + 6 French  native  speakers  (from  the  normative  database).  Each 
segment  and  phrase  were  produced  69  times:  20NNSs  x 2 (pre-,  posttests)  + 
10NNS  Control  x 2 (pre-,  posttests)  + 9NNSs  (post-longitudinal  test).  The  three 
sets  (two  segments  and  one  phrase)  of  69  speech  samples  were  randomized  in 
order  to  avoid  biased  rating;  if  the  speakers’  samples  had  been  systematically 
presented  to  the  listener  in  the  pre/post  order,  the  raters  might  have 
unconsciously  ‘heard’  an  improvement,  whether  actually  present  or  not.  After 
every  ten  NNS  samples,  the  production  of  a native  speaker  was  inserted  to 
recalibrate  the  rater’s  template.  The  raters  had  no  previous  information  of  this 
procedure;  however,  in  the  introduction  to  their  task,  they  were  informed  that  the 
stimuli  they  were  about  to  rate  contained  both  native  and  non-native  samples. 

Each  listener  had  to  rate  each  of  the  75  one-syllable  word  tu/tout  on  a 3- 
point  scale,  i.e. , categories  of  accent:  native  (Cl ),  mild  (C2),  or  strong  (C3).  The 
phrase  was  rated  on  a four-point  scale:  native  (Cl),  mild  (C2),  moderate  (C3),  or 
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strong  foreign  accent  (C4).  The  differentiation  between  the  3-point  and  4-point 
scales  had  been  determined  during  the  pilot  study.  The  acoustic  measurements 
of  the  individual  speech  samples  were  logged  into  their  respective  categories  of 
perceived  accents  based  on  a 67%  raters’  agreement,  i.e.,  when  four  out  six 
raters  had  selected  the  same  level  or  category  of  accent  (See  Appendix  D). 

The  number  of  samples  (N)  varied:  for  the  one-syllable  word,  tu  N=  37  and 
for  tout,  N=60;  for  all  the  acoustic  variables  examined  for  the  phrase,  N=49.  To 
measure  the  strength  of  the  relationship  between  the  acoustic  parameters  and 
their  effect  on  perceptual  ratings,  a correlation  analysis  was  completed.  A 
statistically  significant  effect  was  found  for  ten  of  the  eleven  acoustic  variables 
selected  as  potential  carriers  of  foreign  accentedness. 

The  results  are  presented  and  organized  according  to  three  sets  of 
acoustic  parameters:  (1)  formant  frequency  variation,  (2)  duration  or  temporal 
cues,  and  (3)  fundamental  frequency  variation. 

Formants 

The  correlation  coefficient  *r’  indicates  a strong  correlation  between 
acoustic  cues  and  perceptual  ratings  for  F2-F1  of  both  [y]  in  tu  ‘you’  and  [u]  in 
tout  ‘everything’  (Tables  4. 4-4.5  and  Figures  4. 1-4-2). 

Table  4.4 

Categories  of  Accent,  Acoustic  Range  (Minimum-Maximum),  and  Means  for  F2  - 
FI  of  [y] 


Mn[y] 

IVfex[y] 

IVban[y] 

d-native 

1824 

2354 

2026 

C2-rrild 

1464 

2188 

1813 

C3-strong 

841 

1625 

1398 
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Formant  2 - Formant  1 of  [y] 


■S-  3000 
£ 2000 


Cl  C2  C3 

Categories  of  Accent 


Min  [y] 
ElM  Max  [y] 
—A—  Mean  [y] 


r = .749 


Figure  4.1 

Categories  of  Accent,  Acoustic  Range,  and  Means  for  F2-F1  of  [y] 
Number  of  Samples  (N)  = 37 

Table  4.5 

Categories  of  Accent,  Acoustic  Range,  and  Means  for  F2-F1  of  [u] 


Mn  [u] 

Max[u] 

Mean[u] 

Cl -native 

733 

1592 

1196 

C2-nild 

1473 

1898 

1696 

C3-strong 

1718 

2313 

2033 

Note:  The  relatively  high  values  for  [u]  are  related  to  the  influence  of  the 
preceding  dental-alveolar  [t]  (Locus  at  1800Hz). 


r = .889 


Figure  4.2 

Categories  of  Accent,  Acoustic  Range  and  Means  for  F2-F1  of  [u] 
Number  of  Samples  (N)  = 60 
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The  first  set  of  tables  and  figures  displays  an  inverse  relationship  between 
F2  - FI  of  [y]  and  degrees  of  perceived  accent:  as  F2- FI  of  [y]  decreases,  the 
level  of  foreign  accentedness  increases.  Conversely,  there  is  a direct 
relationship  between  F2  - FI  of  [u]  and  degrees  of  perceived  accent:  as  F2  - FI 
of  [u]  increases,  so  does  the  degree  or  category  of  perceived  accent. 

The  correlation  coefficient  for  the  Formant  1 frequencies  of  unstressed  [a]  in 
attention  (r=.38)  indicates  that  as  the  formant  1 values,  measured  in  terms  of 
means,  decrease,  the  degree  of  perceived  foreign  accent  increases.  The 
relationship  is  inversely  proportionate. 


Table  4.6 

Categories  of  Accent,  Acoustic  Range,  and  Means  for  FI  of  Unstressed  [a]  in 
attention. 


Cl 

C2 

C3 

C4 

Minimum 

613 

513 

463 

315 

Maximum 

734 

998 

905 

941 

Means 

681 

717 

692 

530 

Formant  1 of  Unstressed  [a] 


N 1000 


£ 

0 

X 


Cl  C2  C3  C4 
Categories  of  Accent 


Minimum 
mm  Maximum 
—a—  Means 


r = .385 


Figure  4.3 

Categories  of  Accent,  Acoustic  Range,  and  Means  for  FI  of  Unstressed  [a] 
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The  means  indicate  a lack  of  clear  distinction  between  the  three  first 
categories:  native,  mild,  and  moderate.  However,  the  distinction  between  these 
three  degrees  of  perceived  accent  and  the  fourth  category  (strong  foreign  accent) 
is  unequivocal. 

Temporal  Acoustic  Cues 

The  coefficient  ‘r’  is  strong  for  all  temporal  cues,  ranging  from  r = .62  to 
r = .67.  Tables  4.7-4.10  display  the  distribution  of  the  means  for  each  acoustic 
parameter.  The  means  were  logged  according  to  categories  of  accent:  Cl -native; 
C2-mild;  C3-moderate,  and  C4-strong.  Figures  4. 4-4. 7 illustrate  the  relationship 
for  each  variable.  For  each  acoustic  cue,  the  number  of  observations  (N)  is  49. 


Table  4.7 

Categories  of  Accent:  Acoustic  Range,  Means  and  Ratios  for  Total  Duration  of 
Utterance  of  Fais  attention  de  ne  pas  glisser  ‘Careful  not  to  slip’. 


Cl 

C2 

C3 

C4 

Minimum 

1.373 

1.431 

1.553 

1.839 

Maximum 

1.985 

2.264 

2.219 

2.91 

Means 

1.55 

1.648 

1.84 

2.271 

Ratio  of  Means 

1.06 

1.19 

1.47 

Total  Du  ration  ofUtterance 


Cl  C 2 C 3 C 4 

Categories  of  Accent 


r = 


.651 


Figure  4.4 

Total  Duration  of  Utterance:  Categories  of  Accent  Based  on  Acoustic  Range  and 
Means. 


Table  4.8 

Categories  of  Accent,  Acoustic  Range,  and  Means  of  VOT  [p]  in  pas 
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Cl 

C2 

C3 

C4 

Min. 

6 

5 

8 

11 

Max. 

11 

27 

43 

65 

Means 

8 

13 

17 

35 

Ratio  of  Means 

1.63 

2.13 

4.38 

r = .678 


Figure  4.5 

VOT  of  [p]  in  pas  ‘not’:  Categories  of  Accent  Based  on  Acoustic  Range  and 
Means. 

Table  4.9 

Categories  of  Accent:  Duration  Range,  Means,  and  Ratios  of  Unstressed  Syllable 
"-tten”  in  attention. 


Cl 

C2 

C3 

C4 

M inimum 

126 

108 

131 

141 

M axim  um 

137 

169 

170 

207 

M eans 

132 

132 

152 

172 

Ratio  of  M eans 

1 

1.15 

1.3 

Syllable  Duration  of  '-tten' 


.645 


Figure  4.6 

Syllable  Duration  of  “-tten”:  Categories  of  Accent  Based  on  Acoustic  Range  and 
Means. 
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Table  4.10 

Categories  of  Accent:  Duration  Range,  Means,  and  Ratios  of  Stressed  Syllable 
“pas"  in  de  ne  pas  glisser  ‘not  to  slip’. 


Cl 

C2 

C3 

C4 

Minimun 

98 

104 

77 

123 

Maximum 

106 

185 

182 

202 

Means 

102 

127 

136 

156 

Ratio  of  Means 

1.2 

1.3 

1.5 

Syllable  Duration  of  [pa] 


■ Max. 
Means 

Cl  C2  C3  C4 
Categories  of  Accent 


.623 


Figure  4.7 

Syllable  Duration  of  “pas”:  Categories  of  Accent  Based  on  Range  and  Means. 


In  each  situation  the  correlation  coefficient  is  a measure  of  the  strength  of 
the  linear  association  between  each  temporal  parameter  and  perceived  degrees 
of  foreign  accent.  The  relationship  is  directly  proportionate:  as  the  temporal 
values  increase,  the  degree  of  perceived  accent  equally  increases. 

Fundamental  Frequency  Variation 

The  correlation  coefficients  Y reported  for  unstressed  ‘-tten’  in  attention, 
stressed  'pas’  and  stressed  ‘-sser’  in  glisser  were:  .39,  .31,  and  .45  respectively. 
The  correlation  is  indicative  of  relationship  between  F0  peaks  and  valleys 


denoting  [+  stress]  vs.  [-stress]  syllables  relative  to  their  adjacent  syllables  and 
degrees  of  accent. 
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Table  4.11 

Fundamental  Frequency  Variations  Characterizing  the  Alternation  between 
[+stress]  and  [-stress]  Syllables  and  Categories  of  Foreign  Accent 


Fais  a 

tten 

tion 

dene 

pas 

gii 

sser 

Cl -native 

280 

241 

293 

191 

261 

214 

273 

C2/3-mild 

264 

242 

260 

204 

244 

220 

245 

C4-strong 

257 

239 

246 

204 

239 

235 

233 

Note:  To  facilitate  the  reading  of  the  table  and  matching  graph,  the  two  levels  of 
foreign  accent  -Category  2/mild  and  Category  3/moderate--  have  been 
averaged. 


Phrasal  Prosodic  Pattern  and  Categories  of 

Accent 


N 

t? 

CD 

X 


c 

o 

-*—> 

ro 

ro 

> 


300 
270 
£240 
210 
180 

Stressed  vs.  Unstressed  Syllables 


Figure  4.8 

Prosodic  Pattern  of  the  Phrase  Fais  attention  de  ne  pas  glisser  and  Categories  of 
Accent  Based  on  Fundamental  Frequency  Variations. 

‘-tten’  r = .3865;  pas  r = .3171;  -sser'r  = .4515 


Considering  the  angle  that  characterizes  the  valley  of  unstressed  [ta]  “- 
tten”,  relative  to  the  adjacent  syllables  “Fais  a-“  and  “tion”  in  Fais  attention,  a 


73 


widening  of  this  angle,  i.e.,  a flattening  effect,  correlates  with  an  increase  in  the 
perception  of  degrees  of  foreign  accented  pronunciation.  Similarly,  considering 


the  angle  that  characterizes  the  peak  of  stressed  [pa]  in  de  ne  pas  glisser , 
relative  to  the  unstressed  adjacent  syllables,  a widening  of  this  angle  correlates 
with  an  increase  in  foreign  English  accented  pronunciation  in  French.  The 
stressed  syllable  “-sser”,  marked  as  end-of-rhythmic  group,  matches  the  same 
characterization.  A decrease  in  fundamental  frequency  correlates  with  an 
increase  in  the  levels  of  perceived  foreign  accent  (Table  4.1 1 and  Figure  4.8). 


CHAPTER  5 
DISCUSSION 

The  results  of  this  study  suggest  that  of  the  eleven  acoustic  parameters 
analyzed  to  determine  which  of  them  were  carriers  of  foreign  accent,  ten  were 
associated  with  perceived  degrees  of  accent.  A strong  correlation  was  found 
between  perceived  degrees  of  accent  and  the  following  acoustic  parameters:  F2 
- FI  of  both  [y]  and  [u],  total  duration  of  utterance  (TDU),  voice  onset  time  (VOT) 
of  syllable-initial  [p],  and  syllable  duration  and  F0  as  markers  of  stressed  vs. 
unstressed  syllables.  The  means  and  mean  differences  between  groups  also 
indicated  that  temporal  acoustic  parameters,  F2  - FI  frequency  of  [u],  and  F0  of 
final  stressed  syllable  constitute  appropriate  measures  to  quantify  progress  over 
training  in  pronunciation,  training  on  the  Visipitch  II,  and  long-term  retention. 

The  findings  of  this  study  will  be  discussed  in  order  of  the  experimental 
questions.  First,  can  improvement  in  pronunciation  over  formal  training-time  be 
acoustically  quantified  relative  to  the  native  (NS)  target  baseline?  Second,  can 
the  effect  of  technology  (Visipitch  II)  in  formal  pronunciation  training  be 
acoustically  quantified?  Third,  can  long-term  retention  in  foreign  pronunciation  be 
quantified?  And  fourth,  do  selected  acoustic  parameters  correlate  with  auditory- 
perceptual  ratings?  For  each  question,  the  discussion  will  include  a comparison 
to  relevant  studies  previously  mentioned  in  the  literature  review,  interpretations  of 
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and  extrapolations  from  the  statistical  results  (both  descriptive  and  inferential 
statistics),  implications  and  suggestions  for  further  research. 

The  Quantification  of  Formal  Training 
Progress  in  pronunciation  is  contingent  upon  criteria  such  as  age  of 
learning,  length  of  residence,  exposure  to  and  use  of  the  target  language  (Celce- 
Murcia  et  al,  1997;  Ellis,  1994;  Flege,  1995).  The  validity  of  testing  pronunciation 
and  oral  skills  is  questioned  in  view  of  the  fact  that,  in  natural  running  speech, 
speakers  (native  and  non-native)  very  rarely  produce  an  identical  utterance 
twice.  In  other  words,  the  same  conceptual  message  can  be  linguistically 
produced  with  different  words,  syntax,  stress  patterns,  and/or  pitch. 

Consequently,  assessing  changes  over  pre-  and  posttest  productions  is 
purported  to  be  invalid.  Furthermore,  empirical  measurement  of  progress  in 
pronunciation  over  time,  especially  studies  on  prosodic  patterns,  in  the  formal 
setting  is  needed. 

This  study  was  designed  to  quantify  pronunciation  progress  in  the 
classroom  both  at  the  segmental  and  suprasegmental  levels.  It  provides 
evidence  that  change-be  it  progress,  regression-  or  zero-change  over  training 
time  can  be  objectively  measured.  The  results  provide  significant  evidence  of  this 
change  for  three  acoustic  parameters:  total  duration  of  utterance,  VOT  of 
syllable-initial  [p],  and  syllable  duration  of  stressed  [pa].  In  addition,  the  means 
and  difference  between  pre-  and  post  means  indicate  (1)  NNS  improvement 
relative  to  the  target  NS,  in  eight  acoustic  parameters  out  of  eleven,  but  (2)  lack 
of  improvement  for  F2  - FI  of  both  [y]  and  [u]  and  for  fundamental  frequency  of 
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unstressed  syllable  ‘pas’.  The  20  NNSs  raised  their  FI  values  for  [a],  reduced 
duration  for  all  five  temporal  cues,  raised  the  average  fundamental  frequency  on 
stressed  [se],  and  reduced  it  on  unstressed  [taj.  Conversely,  the  means  and 
difference  of  means  indicate  a further  deviation  from  NS  target,  for  F2  - FI  of  [y] 
and  [u]  and  for  the  fundamental  frequency  of  stressed  [pa].  New  statistical 
significance  for  these  variables  may  differ  by  increasing  the  sample  size, 
selecting  different  cues  or  measures,  or  altering  the  design  of  the  elicitation  task. 
For  example,  in  the  case  of  [y]  and  [u],  the  vicinity  of  [t]  led  some  speakers  to 
shape  the  oral  cavity  to  the  point  of  altering  the  quality  of  the  vowels,  which  in 
turn  altered  the  acoustic  measures. 

Furthermore,  the  pre-  and  posttest  variations  exhibited  by  the  ten 
experienced  French  NNSs  of  the  control  group--for  which  zero-training  was 
expected  to  produce  a zero-effect--lead  to  the  conclusion  that  changes  over  time 
may  not  necessarily  or  exclusively  result  from  training  (See  Appendix  E).  Rather, 
these  changes  may  result  from  a mere  variation  in  natural  speech  output.  This 
observation  supports  the  Second  Language  Acquisition  claims  that  pre-,  post 
testing  in  pronunciation  is  disputable  in  view  of  all  the  extraneous  variables 
involved  in  speech  production,  be  it  the  native  or  the  target  language. 

In  summary,  the  results  of  this  study  suggest  that  the  temporal  acoustic 
cues  total  duration  of  utterance,  VOT  [p],  and  [+stress]  syllable  duration 
constitute  reliable,  objective  acoustic  parameters  that  can  be  measured  to 
quantify  progress  over  training-time  in  pronunciation  in  the  classroom.  However, 
for  the  other  acoustic  parameters,  speculations  as  to  further  generalizability  are 
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questionable  due  to  lack  of  statistical  significance.  Consequently,  at  this  point, 
auditory-perceptual  assessment  constitutes  a more  practical  method  to  evaluate 
pronunciation  progress  in  the  classroom. 

The  Quantification  of  Visioitch  (VP)  vs.  Non-Visipitch  Training  (NVP) 
Previous  research  on  the  effect  of  software  on  the  learning  of  the  sound 
system  of  a foreign  language  shows  that  practice  with  programs,  such  as,  the 
Visipitch  (Kay  Elemetrics),  is  as  helpful  and  effective  as  private  tutoring,  and 
more  effective  than  the  traditional  audio-tape  condition  (Landhal  & Ziolkowski, 
1995).  The  audio-visual  feedback,  provided  by  the  Visipitch,  raises  learners’ 
awareness,  guides  them  to  monitor  their  own  speech,  and  helps  them  “lose  their 
self-consciousness  about  pronunciation  because  they  become  fascinated  with 
the  visual  display”  (Andersen-Shieh,  1992).  In  the  works  consulted,  studies 
referred  to  Korean,  Japanese  and  Mandarin  Chinese  students  acquiring  the 
English  sound  patterns.  The  speech  samples  were  essentially  minimal  pairs.  The 
training  time  ranged  from  20  minutes  per  phonological  contrast  to  a total  of  eight 
hours  of  practice.  However,  none  of  the  studies  compared  VisiPitch  (VP)  vs. 
Non-VisiPitch  (NVP)  groups  of  students  learning  French  over  a semester  course 
producing  segments  as  well  as  phrases  and/or  complete  sentences.  The  current 
study  has  expanded  the  language  realm  to  French,  has  increased  the  size  of  the 
speech  samples  from  minimal  pairs  to  natural  connected  speech,  and  has 
delineated  the  experiment  within  the  classroom  timeframe. 

The  statistical  findings  of  this  portion  of  the  study  suggest  that  the  VP 
group  performed  better  than  the  NVP  group  in  one  particular  instance  of  stress 
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assignment  of  [+stress]  “-sser”  denoted  by  a sharp  increase  in  the  fundamental 
frequency  to  indicate  continuity  in  the  sentence.  Conversely,  the  VP  group 
regressed  relative  to  the  NVP  in  the  production  of  [u]  while  the  NVP  values  for  FI 
of  [a]  show  a closer  approximation  to  NS  than  the  VP  group.  The  lack  of 
improvement  by  the  VP  speakers  may  stem  from  the  fact  that  this  piece  of 
software  primarily  focuses  on  voice-voiceless  contrasts,  pitch  and  duration 
patterns  and  not  on  vowel  quality.  The  difference  between  means  indicate  that 
the  VP  group  approximated  the  NS  target  better  than  the  NVP  subjects  in  four 
out  of  five  temporal  cues  and  fundamental  frequency  cues  for  syllables  “-tten”. 
However,  these  results  cannot  be  generalized  beyond  the  present  linguistic 
context  due  to  the  lack  of  statistical  significance. 

In  summary,  the  results  of  this  part  of  the  experiment  suggest  that  there  is 
a significant  difference  between  VP  and  non-VP  training  in  two  out  of  eleven 
variables:  F2  - FI  of  [u]  and  fundamental  frequency  variation  of  [+stress]  “-sser”. 
Further  research  should  focus  on  an  increase  in  the  number  of  subjects,  as  well 
as  the  number  of  practice  hours  with  the  VP.  Indeed,  the  VP  training  may  have 
been  too  short  (less  than  three  hours  over  the  16-week  semester).  Due  to 
University  regulations,  an  increase  in  VP  training-time  was  not  permitted  since 
this  might  have  compromised  equality  of  teaching  between  the  two  groups  during 
a normal  semester. 

The  Quantification  of  Long-term  Retention 

The  literature  highlights  the  fact  that  very  few  studies  have  examined  long- 
term retention  in  the  formal  training  environment  (Ellis,  1994).  The  present  study 
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attempted  to  answer  the  need  by  carrying  on  a longitudinal  study  in  the 
classroom  setting  although,  admittedly,  problems  were  experienced  in  gathering 
subjects  and  administering  the  test  (posttest  2)  three  months  after  the  class  on 
French  pronunciation  had  ended. 

The  findings  indicated  a significant  difference  between  posttest  1 and 
long-term  improvement  (posttest  2)  in  two  specific  temporal  cues:  total  duration 
of  utterance  and  duration  of  [-stress]  syllable  “-tten”.  The  lack  of  statistical 
significance  for  the  remaining  acoustic  variables  may  result  from  the  fact  that  the 
posttest  values  (end-of-course)  were  already  very  close  to  the  NS  target 
compared  to  the  pre-test  values  (Table  4.1).  In  other  words,  it  is  possible  that  the 
nine  NNS  students  had  already  learned  how  to  monitor  some  features  of  French 
pronunciation  by  the  end  of  the  course.  If  this  was  the  case,  the  retention  or 
maintenance  level  would  not  significantly  differ  from  the  posttest.  However,  the 
difference  between  posttest  1 and  posttest  2 means  indicate  that  except  for 
regression  in  the  production  of  French  [u],  the  long-term  group  has  improved  in 
all  the  remaining  variables  (Table  4.3). 

The  results  of  this  section  of  the  study  suggest  that  long-term  retention 
can  be  quantified.  This  finding  supports  claims  in  Second  Language  Acquisition 
research  indicating  that  the  acquisition  of  an  L2  sound  system  is  indeed 
attainable,  regardless  of  and  independent  from  the  type  of  feedback.  In  future 
research,  further  statistical  tests  could  be  run;  for  example,  in  lieu  of  comparing 
posttest  1 with  posttest  2 results,  an  analysis  including  the  pretest  values  for 
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these  nine  NNS  learners  may  generate  stronger  evidence  of  long-term  retention 
beyond  the  classroom  course. 

Correlation  Between  Acoustic  Measures  and  Perceptual  Ratings 
Several  studies  in  Speech-Language  Pathology  and  Second  Language 
Acquisition  highlighted  the  problems  that  listeners  encounter  while  performing 
auditory-perceptual  ratings,  such  as,  the  natural  inclination  to  listen  holistically, 
the  subjectivity  of  the  individual  perceptual  templates,  the  validity/reliability 
criteria  of  scale  resolution,  and  inter-raters’  agreement.  (Flege,  1995;  Kent, 

1996;  Kreiman  et  al.,  1993,  1998). 

The  present  research  encountered  similar  problems.  The  six  native 
French  raters  underscored  the  artificiality  of  assessing  segments  cut  off  from  the 
rest  of  the  utterance.  They  complained  that  they  could  ‘hear’  sounds  that  were 
actually  not  part  of  the  segment  they  were  rating;  this  was  indeed  the  case  for  [ty] 
that  had  been  segmented  from  Ou  es-tu  Robert?  ‘Where  are  you  Robert?  In  a 
sense,  because  of  coarticulation,  they  could  ‘hear’  the  consonant  [r]  as  well  as 
the  [o]  intertwined  with  the  vowel  [y].  The  raters  also  admitted  feeling  uncertain 
about  their  judgment  and  acknowledged  having  changed  their  rating  on  several 
occasions.  This  confirms  the  problem  with  intra-raters’  variation,  i.e.,  the 
production  of  different  ratings  of  the  same  sample  at  different  times  (Kreiman  et 
al.,  1998).  The  presence  of  NS  samples  designed  to  reset  the  listeners’ 
templates  or  standards  to  the  target  sounds,  remained  unnoticed. 

Concerning  scales,  the  literature  claims  that  raters’  agreement  is  usually 
higher  on  the  endpoint  of  the  scale  than  in  the  middle-range  (Kreiman  et  al., 
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1993,  1998).  The  results  of  the  present  study  indicate  a definite  clear  distinction 
for  [u]  in  tout  for  which  80%  of  the  raters  agreed  at  the  endpoints  of  the  three- 
point  scale.  For  the  segment  [y]  in  tu,  interrater  disagreement  prevailed  for  51% 
of  the  utterances  and  only  two  of  the  native  speakers  were  rated  in  category  1 , 
as  no-accent.  In  the  phrasal  assessment,  53.7%  of  the  raters  agreed  at  scale 
endpoints.  This  study,  therefore,  supports  the  claim  mentioned  above. 

More  importantly,  the  literature  underscored  the  need  to  further  document 
the  relationship  between  acoustic  cues  and  auditory  judgments  (Kent,  1996).  The 
findings  of  the  present  study  indicate  that  there  is  an  association  between  ten  of 
the  eleven  acoustic  parameters  analyzed  as  carriers  of  foreign  accentedness  and 
perceived  levels  of  accent.  The  Pearson  Correlation  Coefficient  indicated  a 
strong  correlation  for  F2  - FI  for  both  [y]  and  [u]  and  the  following  temporal  cues: 
total  duration  of  utterance  (TDU),  VOT  of  [p],  and  [+stress]  /[-stress]  syllable 
duration.  Although  less  striking,  the  coefficients  for  FI  of  unstressed  [a]  and  for 
the  fundamental  frequency  variation  equally  denote  a correlation  between  stress 
patterns  and  degrees  of  perceived  foreign  accent. 

Formant  2 - Formant  1 of  M and  ful 

It  may  seem  paradoxical  that  on  the  one  hand,  the  two  vowels  [y]  and  [u] 
constitute  a factor  of  regression— as  seen  in  the  three  previous  experimental 
questions-,  but  that  on  the  other  hand,  the  results  of  the  statistics  indicate  a 
strong  correlation  between  the  acoustic  measure  and  the  perceived  degrees  of 
accent.  The  facts  are  actually  substantiating  the  problems  highlighted  in  the 
literature  review.  Both  vowels  are  [+high],  i.e.,  the  incisor  separation  is  small, 
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which  correlates  with  low  FI  values.  However,  both  vowels  vary  greatly  in  their 
phonetic  space,  i.e. , the  volume  of  the  oral  cavity,  due  to  (1 ) tongue  position  and 
(2)  lip  rounding.  From  [y]  to  [u]  the  high  point  of  the  tongue  moves  backward, 
lowering  the  F2  values  (Delattre,  1966;  Kent  & Read,  1992).  The  two  vowels  also 
differ  in  their  rounding:  [y]  has  an  exolabial  rounding  somehow  similar  to  a 
rounded  [i]  while  [u]  has  an  endolabial  rounding,  i.e.,  an  inward  pulling  of  the 
cheeks,  which  lengthens  the  oral  cavity  and  lowers  the  F2  (Catford,  1988).  From 
an  L2  acquisition  viewpoint,  Flege  (1987,  1995)  had  documented  that  English 
NNSs  have  greater  difficulty  producing  [u],  assumed  to  be  in  their  sound  system, 
than  [y],  which  is  absent  from  the  English  repertoire.  The  complexity  of  the 
simultaneous  movements  between  lip  rounding  and  tongue  backward  or  forward 
movement  is  represented  in  Table  5.1  and  illustrated  in  Figure  5.1.  The  mean 
values  of  F2-F1  for  [y]  and  [u]  for  the  three  categories  of  accent  (native,  mild,  or 
strong)  display  a chiasmus:  as  F2-F1  [y]  decreases,  the  degree  of  perceived 
accent  increases  (inverse  relationship);  as  F2-F1  [u]  decreases,  the  degree  of 
perceived  accent  also  decreases  (direct  relationship). 

The  present  results  indicate  that  the  NNSs  as  a whole  tend  to  produce  [y] 
with  higher  F2  values,  i.e.,  approximating  the  [i],  and  tend  to  produce  [u] 
approximating  the  French  [y].  In  absolute  numbers,  all  NNS  values  for  [y]  seem 
to  be  closer  to  the  target  [y]  while  their  efforts  to  approximate  target  NS  [u]  seem 
to  end  up  in  deviating  further  from  the  target.  This  would  support  Flege’s  claim 
(1987,  1995)  that  -within  the  present  linguistic  context — learners  produce  [u], 
classified  as  ‘similar’,  less  successfully  than  [y],  classified  as  ‘new’. 


Table  5.1 

Means  of  F2-F1  of  [y]  and  [u]  and  Categories  of  Foreign  Accent 
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Mean  F2-F1  [y] 

Mean  F2-F1  [u] 

Cl-Native 

2026 

1196 

C2-Mild 

1813 

1696 

C3-Strong 

1398 

2033 

Chiasmus  of  F2-F1  of  [y]  and  [u] 


Mean  [y] 
Mean  [u] 


Figure  5.1 

Chiasmus  - Inverse  Relationship  between  F2-F1  of  [y]  vs.  [u]  and  Degrees  of 
Accent:  Cl -Native;  C2-  Mild;  C3-Strong  Foreign  Accent. 


From  the  illustration,  it  appears  that,  in  the  category  perceived  as  mildly 
accented,  the  F2  - FI  acoustic  values  for  both  [y]  and  [u]  quasi  overlap  each 
other.  More  research  is  necessary  to  examine  the  articulatory  correlates  of  this 
[y]/[u]  chiasmus.  More  specifically,  measurements  of  the  effect  of  exolabial  vs. 
endolabial  rounding,  and  the  tongue  dorsum  movements  might  clarify  the  fact 
that  lip  rounding  hides  shapes  formed  inside  the  mouth  (Catford,  1988). 
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Formant  1 of  Unstressed  fal  in  Attention 

The  data  suggest  an  association  between  FI  of  [a]  and  degree  of  accent. 
The  weakness  of  the  correlation  may  be  due  to  the  lack  of  clear  discrimination 
between  native,  mild  and  moderate  degrees  of  accent  (See  Results,  Table  4.7 
and  Figure  4.3).  This  may  be  the  result  of  coarticulation,  which  distorted  the 
acoustic  readings  for  some  of  the  samples.  In  these  cases,  the  immediately 
preceding  [e]  of  “fais  a-”  [fea],  in  fais  attention,  specifically  when  the  liaison  [z] 
was  not  released,  influenced  the  formant  frequencies  of  [a].  This  phenomenon 
was  observed  in  the  NSs  production  of  [a];  the  corresponding  formant  frequency 
values  would  then  resemble  a vowel  reduction. 

Fundamental  Frequency  Variation  and  Syllable  Duration  as  Cues  for  Stressed- 
Unstressed  Alternation 

French  rhythm  has  been  classified  as  syllable-,  trailer-timed,  while  English 
rhythm  has  been  classified  as  stress-,  leader-timed.  English  would  express 
rhythmic  patterns  with  dominant  accentuation  whereas  French  would  prefer 
temporal  organization  (Delattre,  1966;  Vaissiere,  1991;  Wenk  & Wioland,  1982). 
However,  these  are  considered  as  ‘trends’  rather  than  rigid  principles  aimed  at 
governing  prosody  (Anderson-Shieh,  1992).  Prosodic  patterns  have  been 
acoustically  defined  in  the  literature  in  terms  of  intensity,  fundamental  frequency 
rise  and/or  fall,  and  duration.  The  literature  underscored  these  acoustic  cues 
duration  appeared  to  be  consistent  (Delattre,  1966).  In  addition,  the  literature 
equally  indicated  that  of  these  acoustic  parameters,  the  salience  of  one  cue  over 


the  other  is  controversial. 
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The  present  study  analyzed  both  the  pitch  variation  and  duration  across 
the  phrase  fais  attention  de  ne  pas  glisser  and  their  relationship  with  degrees  of 
perceived  accent.  In  the  phrase  under  investigation,  “pas”  [pa]  is  bearing  an 
emphatic  stress,  and  “-sser”  [se]  is  equally  stressed  as  group-final  syllable.  The 
stress  on  these  two  syllables  was  expected  to  translate  into  higher  pitch.  The 
unstressed  syllable  “-tten”  [ta]  was  predicted  to  show  a drop  in  pitch  relative  to 
the  adjacent  ones.  As  for  duration,  “pas”,  being  part  of  the  negative  infinitive  ne 
pas  glisser  ‘no  to  slip’,  was  expected  to  have  a short  duration  as  well  as 
unstressed  “-tten".  Deviations  from  these  standards  were  expected  to  denote  the 
presence  of  a foreign  accent. 

The  findings  of  the  present  research  indicate  that  all  the  temporal  acoustic 
parameters  strongly  correlate  with  auditory  ratings:  as  duration  increases  so 
does  the  degree  of  foreign  accent.  From  a cross-linguistic  viewpoint,  these 
results  do  not  support  Munro  and  Derwing’s  claim  (1998)  that  “slowing  down  may 
not  help  second  language  learners”  (159). 

Concerning  variations  in  fundamental  frequency,  the  alternation  between 
[+stress]  and  [-stress]  syllables  relative  to  their  adjacent  ones  generates  “angles” 
that  create  a rise-fall  intonation  pattern.  The  relationship  between  peaks  and 
valleys  denotes  stressed  vs.  unstressed  syllables  relative  to  their  adjacent 
syllables.  The  study  suggests  that  a flattening  of  these  angles  correlates  with  an 
increase  of  perceived  degree  of  foreign  accent.  In  other  words,  the  experiment 
indicates  that  when  the  peak  of  a stressed  syllable  or  the  valley  of  an  unstressed 
syllable  loses  its  sharpness,  it  is  perceived  as  an  increase  in  the  degree  of 
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foreign  accented  pronunciation.  This  study  further  suggests  that  approximation  to 
native  pronunciation  is  contingent  upon  a degree  in  the  rise-fall  pattern  of 
fundamental  frequency  relative  to  the  adjacent  syllables. 

When  variations  of  both  duration  and  fundamental  frequency  patterns  of 
are  joined  together,  an  inverse  relationship  emerges: 

• As  duration  increases  so  does  the  perceived  degree  of  foreign  accentedness; 

• In  fundamental  frequency,  the  narrower  the  angle-whether  peak  of  stressed 
syllable,  or  valley  of  unstressed  syllable — the  closer  the  approximation  to 
native  pronunciation. 

These  findings  do  not  support  theories  claiming  that  French  prefers  the 
temporal  organization.  Rather,  this  research  suggests  that,  in  the  present 
linguistic  context,  sharp  degrees  in  pitch  variation  correlate  with  the  native 
French  accent  whereas  a monotonous  pitch  correlates  with  an  English  foreign 
accent.  In  other  words,  French  prosody  may  be  classified  as  syllable-timed,  but  it 
also  relies  on  accentuation.  Degrees  of  deviations  from  the  native  accentuation 
pattern  appear  to  be  perceived  as  foreign  accentedness. 

Further  research  on  degrees  of  intonation  “angles”  is  recommended.  For 
example,  the  effect  of  the  speakers’  inhibition  --  testing  environment--  on  the 
correlation  between  F0and  degrees  of  accent  could  be  examined.  Broader 
linguistic  contexts  could  also  be  investigated  as  well  as  the  effect  of  duration  and 
F0  as  interacting  parameters. 
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General  Limitations  of  the  Study 

For  several  acoustic  parameters,  the  average  inter-group  performances 
did  not  significantly  differ  from  each  other.  This  may  be  due  partly  to  factors  such 
as,  the  saliency  -or  lack  of  saliency-  of  the  selected  acoustic  parameters, 
instrumentation  problems,  and/or  coarticulation.  Coarticulation,  which  is  viewed 
as  a critical  component  in  the  perception  of  speech,  was  also  a cause  of 
inconsistent  measures,  e.g.,  fais  a-  [fea],  de  ne  [dcence]/[dcen]/dnoe],  and  [ty]-[tu]. 
Similarly,  the  phrasal  pitch  pattern  constituted  a limitation.  The  phrase  under 
investigation  was  part  of  a longer  sentence  to  be  uttered  in  the  imperative  mode. 
However,  the  theories  as  to  what  pitch  contour  is  appropriate  emphasize  the 
complexity  of  the  patterns  due  to  implicit  semantic  messages  (Valdman,  1993).  In 
this  particular  instance,  the  learners  may  not  have  been  alerted  to  the  intricacy  of 
this  part  of  the  dialogue. 

Implications  and  Future  Directions 

From  a pedagogical  and  Second  Language  Acquisition  viewpoint,  the 
results  of  the  study  emphasize  problems  as  to  the  monitoring  of  ‘similar’  [u]  rather 
than  ‘new’  [y]  and  highlight  the  importance  of  pitch  variation  as  carrier  of  foreign 
accent  throughout  prosody  (Pickering,  1999).  The  study  also  indicates  that  pre- 
and  post-testing  may  not  be  valid  to  evaluate  pronunciation  performance  in  the 
classroom;  this,  in  turn,  suggests  the  need  for  alternative  methods  to  assess 
actual  improvement  in  pronunciation. 

From  an  empirical  viewpoint,  the  study  highlights  the  delicate  balance  to 
be  kept  between  instrumentation,  number  of  speakers  and  listeners,  and  test 
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design.  In  case  of  replicability,  the  validity-construct  of  the  elicitation  task  must 
guarantee  an  identical  environment  when  contrastive  phonemes  are  being 
investigated.  For  instance,  [y]/[u]  would  be  both  examined  in  sentence-final 
position:  en  veux-tu?  ‘Would  you  like  some?’  vs.  Veut-U  tout ' does  he  want  it 
all?  Further  research  could  compare  (1)  auditory-perceptual  ratings  from 
teachers  of  the  target  language  and  (2)  ratings  from  speakers  of  the  same 
language. 

Concerning  implications  for  the  development  of  software  applied  to  classroom 
needs,  the  study  provides  evidence  that,  in  the  present  linguistic  context,  there  is 
correlation  between  acoustic  cues  and  degrees  of  foreign  accent.  As  a result,  this 
experiment  constitutes  an  initial  step  toward  potential  applications  in  language 
learning  software.  The  ultimate  objective  is  to  numerically  define  the  accuracy, 
within  a range  of  approximation,  of  digitized  speech  produced  by  learners.  The 
range  of  approximation  would  correspond  to  a category  or  degree  of  accent. 
Based  on  acoustic  values,  such  software  would  complement  the  already  existing, 
built-in  audio-visual  biofeedback  with  a corrective  feedback.  In  other  words,  this 
instrument  would  determine  the  distance  between  the  programmed,  digitized 
model  stimulus  and  the  learner’s  output.  Providing  corrective  feedback  would 
expand  the  dimension  and  methodology  of  the  classroom  to  basically  any 
learning  environment. 


APPENDIX  A 

PARTIAL  INVENTORY  OF  CROSS-LINGUISTIC  STUDIES 


PARTIAL  INVENTORY  OF  CROSS-LINGUISTIC  STUDIES 
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APPENDIX  B 

PRE-  AND  POSTTEST  DIALOGUE 

/ 

The  reading  that  speakers  were  asked  to  perform  is  based  on  the 


following  dialogue  taking  place  between  Robert  and  Suzon: 


• Ou  es-tu,  Robert?  Je  ne  vois  rien 

• Where  are  you  Robert?  1 can’t  see 

dans  le  noir. 

anything  in  the  dark. 

• Par  ici,  Suxon.  La  porte  est 

• This  way,  Suzon.  The  door  is 

ouverte,  mais  fais  attention  de  ne 

open,  but  be  careful  not  to  slip 

pas  glisser  en  descendant  les 

when  you  come  down  the  steps. 

escaliers. 

• Comme  il  fait  frais. 

• How  cool  it  is  down  here! 

• Oui,  forcement.  C’est  une  cave! 

• Of  course.  It’s  a cellar! 

• Well,  1 don’t  smell  wine  in  your 

• Et  bien  ta  cave  ne  sent  pas  le  vin. 

cellar,  but  rather  straw. 

Je  sens  plutot  la  paille. 

• Tu  sentirais  peut-etre  la  paille  dans 

• You’d  smell  straw  maybe  in  the 

le  grenier,  mais  pas  ici.  Ecoute,  je 

attic,  but  not  here.  Listen,  1 only 

n’ai  que  deux  allumettes  et  cette 

have  two  matches  and  this 

ampoule  est  brulee.  Attends-moi. 

lightbulb  is  burnt  out.  Wait  for  me 

Je  reviens  tout  de  suite. 

here.  I’ll  be  right  back. 

• Qu’il  fait  noir!  Je  n’aurais  jamais  du 

• How  dark  it  is  in  here!  1 should 

rester  seule! 

never  have  staid  here  alone. 

The  characters  in  bold  were  extracted  and  analyzed  in  the  present  study. 
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APPENDIX  C 

INSTRUMENTATION:  ACOUSTIC  SEGMENTATION 
The  present  waveform  corresponds  to  the  utterance  tout  de  suite 
[tudcesDit]  ‘right  away’.  This  is  one  of  the  displays  provided  on  screen  by  the 
CSL  (Kay  Elemetrics)  from  acoustic  analyses  are  performed.  The  second  part  of 
the  screen  matches  exactly  the  waveform  and  displays  a spectrogram  overlaid 
with  the  formant  history  (FMT)  lines.  The  cursor  is  positioned  on  the  vowel  [u] 
and  gives  the  value  of  1244Hz  for  the  second  formant.  The  cursor  is  used  to 
isolate  and  segment  e.g.,  [u]  from  the  following  voiced  consonant  [d]. 


[ t u d ce  s Hi] 
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APPENDIX  D 

CATEGORIZATION  OF  AUDITORY-PERCEPTUAL  RATINGS 

PER  ACOUSTIC  CUE 


This  appendix  provides  the  lists  of  ratings  per  categories  for  the  segmental 
and  the  suprasegmental  features: 

• [y]  and  [u] 

• Temporal  cues,  prosodic  cues  -Stress  assignment  & intonation-  Syllable 
duration  and  FI  of  [a]. 
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Categorization  of  ratings  for  FI 
and  F2  frequencies  of  'tu' 


Categories 

% 

Token 

FI 

F2 

F2-F1 

agree 

1 : no  accent 

100% 

38 

264 

2370 

2106 

45 

286 

2291 

2005 

85% 

34 

276 

2345 

2069 

43 

327 

2229 

1902 

67% 

5 

285 

2639 

2354 

31 

300 

2218 

1918 

68 

322 

2146 

1824 

Total:  8 

75 

254 

2280 

2026 

stimuli 

2:  mild 

100% 

60 

338 

2348 

2010 

accent 

70 

322 

2251 

1929 

85% 

13 

334 

2414 

2080 

26 

336 

2095 

1759 

44 

347 

2063 

1716 

55 

356 

2204 

1848 

61 

310 

1986 

1676 

71 

347 

1984 

1637 

67% 

8 

369 

2082 

1713 

20 

325 

2014 

1689 

25 

335 

2136 

1801 

32 

307 

2495 

2188 

36 

325 

2218 

1893 

42 

378 

2211 

1833 

46 

411 

2099 

1688 

59 

344 

2014 

1670 

67 

399 

1863 

1464 

Total:  18 

74 

328 

2370 

2042 

stimuli 

3:  strong 

100% 

10 

342 

1757 

1415 

accent 

63 

330 

1291 

961 

58 

342 

1787 

1445 

29 

304 

1772 

1468 

85% 

17 

319 

1912 

1593 

41 

284 

1849 

1565 

54 

294 

1135 

841 

56 

273 

1872 

1599 

64 

294 

1730 

1436 

67% 

12 

437 

2063 

1626 

Total:  11 

37 

334 

1762 

1428 

stimuli 


Grand  Total:  37 
stimuli 
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Categorization  of  ratings  for  FI  and  F2 

frequencies  of  'tout' 

Category  % Token  FI 

F2 

F2-F1 

(continuation) 

F2-F1 

agree 

Cl  100% 

21 

280 

1491 

1211 

85%  1 

339 

2249 

1910 

22 

319 

1450 

1131 

3 

245 

2104 

1859 

26 

286 

1444 

1158 

7 

293 

2393 

2100 

75 

338 

1528 

1190 

14 

254 

2105 

1851 

85% 

6 

254 

1653 

1399 

15 

268 

2290 

2022 

18 

336 

1305 

969 

17 

271 

2236 

1965 

34 

306 

1673 

1367 

24 

244 

2429 

2185 

44 

328 

1666 

1338 

27 

269 

2334 

2065 

55 

318 

1397 

1079 

31 

281 

2394 

2113 

67% 

2 

285 

1523 

1238 

36 

349 

2108 

1759 

9 

389 

1303 

914 

46 

326 

2219 

1893 

33 

325 

1318 

993 

49 

223 

2278 

2055 

41 

269 

1861 

1592 

53 

294 

2443 

2149 

43 

442 

1365 

923 

70 

296 

2406 

2110 

60 

394 

1127 

733 

67%  4 

294 

2414 

2120 

Total:  17 

66 

259 

1646 

1387 

12 

360 

2351 

1991 

stimuli 

74 

243 

1955 

1712 

30 

293 

2011 

1718 

C2  100% 

16 

319 

2108 

1789 

35 

284 

2264 

1980 

20 

267 

1795 

1528 

52 

330 

2246 

1916 

29 

287 

1869 

1582 

64 

264 

2519 

2255 

40 

260 

1922 

1662 

68 

239 

2197 

1958 

85% 

8 

230 

1943 

1713 

72 

292 

2401 

2109 

51 

301 

1964 

1663 

Total:30 

stimuli 

63 

279 

1764 

1485 

Grand  Total:  60  stimuli 

69 

244 

2019 

1775 

67% 

10 

266 

1739 

1473 

38 

275 

2173 

1898 

47 

254 

2105 

1851 

Total:  13 

58 

283 

2063 

1780 

stimuli 

59 

374 

2229 

1855 

C3  100% 

25 

239 

2398 

2159 

28 

279 

2511 

2232 

32 

295 

2240 

1945 

39 

277 

2590 

2313 

42 

349 

2461 

2112 

50 

277 

2477 

2200 

56 

276 

2124 

1848 

73 

292 

2401 

2109 
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TDU  Vot[p]  Fundamental  Frequency  in  Hertz  1 Syllable '"Duration?  FI  of  [a] 
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FI  of  [a] 
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00 

CO 

CD 
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APPENDIX  E 

CONTROL  GROUP:  PRE- AND  POSTTEST 


Pre-  and  posttest  results  of  the  ten  non-native  speakers  experienced  in 
French.  The  NNS  values  in  the  table  indicate  the  distance  between  the  target 
values  and  the  NNS  ones.  The  distance  is  measured  in  terms  of  ratio. 


Acoustic 

Parameters 

Target  -15 
NS 

10  NNSs 

Experienced  in  French 
Control  Group 

Pretest  -Ratio 

Posttest-Ratio 

% 

Difference 

Formants  in  Flerts  (Hz) 

F2  - FI  fyl 

1721 

1.03 

1.13 

-.10 

F2-F1  [u] 

1218 

1.32 

1.25 

.7 

732 

.99 

.95 

-.05 

Duration  in  milliseconds  (msec.) 

Total  duration  of 
utterance 

1.408 

1.09 

1.17 

-.08 

VOT  rti 

22 

1.4 

1.3 

.1 

_ypT  [P] 

11 

1.2 

1.3 

-.1 

Unstressed  ‘-tten’ 

128 

1.04 

1.09 

-.05 

Stressed  ‘pa’ 

104 

1.04 

1.06 

-.02 

Fundamen 
Unstressed  vs.  S 

tal  Frequency  Variations: 
tressed  Syllables  in  Hertz  (Hz) 

Unstressed  ‘-tten’ 

258 

1.00 

.86 

-.14 

Stressed  ‘pa’ 

254 

.96 

.96 

- 

Stressed  ‘sser’ 

279 

.79 

.75 

-.04 
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